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Introduction 

The  purpose  of  this  proposal  is  to  provide  insight  into  gene  environment  interactions.  It  leverages  the  simplified  genetics  and  detailed 
records  of  the  military  working  dog  population.  There  are  several  critical  aspects  to  meeting  the  aims  of  this  proposal.  1)  development 
of  data  driven  selection  criteria,  2)  biological  sampling  of  representative  dogs,  and  3)  generation  of  mathematical  methodologies 
capable  of  handling  heterogenous  data  and  statistical  tests  in  consistent  manner  and  providing  clear  and  understandable  results  that  are 
biologically  valid.  Here  we  provide  a  breakdown  of  the  previous  year’s  work  and  document  our  progress  towards  achieving  the 
specific  aims  we  proposed. 
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Body 


Task  1-  Regulatory  Approval: 

i)  Cooperative  Research  And  Development  Agreements  (CRADAs):  Both  the  data  and  biological  CRADAs 
between  Nationwide  Children’s  Hospital  (NCHRl;  Alvarez,  Lead  PI,  home  institution)/OSU  (Huang  and  Couto, 
Partnering  Pi’s)  and  DoD/USA  were  executed  by  2013. 

ii)  Animal  use  approval  (Institutional  Animal  Care  and  Use  Committee,  lACUC):  The  animal  hospital  at  Lackland 
AFB  received  AAALAC  accreditation  that  is  mandatory  for  military  lACUC  approvals  in  2012.  In  2013,  we 
submitted  final  revisions  on  our  lACUC  protocol  for  the  collection  of  biological  samples  and  Lackland  veterinary 
approval  was  granted;  and  final  Lackland  AFB  oversight  approval  was  granted  and  those  documents  were 
submitted  to  DoD  CDMRP  grant  administration.  Currently,  there  is  one  final  approval  from  ACURO  pending 
(and  expected,  according  to  their  original  anticipated  timeline,  within  ~1  month),  at  which  time  biological  sample 
collection  can  be  initiated. 


Task  2-  Data  Capture  of  Veterinary  Records:  By  having  Ms.  Michelle  Perez,  Veterinary  Technician,  embedded  in  the 
military  dog  health  service  at  Lackland  AFB,  we  have  been  acquiring  clinical  and  associated  data  from  military  dogs.  This 
was  made  possible  by  the  execution  CRADA’s  in  2013  (Task  1).  The  veterinary  clinical  cancer  and  medical  records 
expertise  was  provided  by  Dr.  Couto.  We  have  been  using  that  data  in  two  parallel  tracks,  (i)  In  the  first  track,  we  have 
been  using  data  forms  to  create  advanced  methods  for  capturing  paper-based  data  and  converting  those  to  electronic  data 
(which  is  classified  as  raw  or  manually  confirmed  to  accurately  represent  the  original)  (using  custom  form  versions  of 
ABBYY  software).  That  work  was  initiated  in  the  technical  sense  before  we  had  CRADA’s  in  place  to  use  it  on  real  DoD 
military  dog  health  records.  In  2013,  Mr.  Terry  Camerlengo  and  his  subsequent  replacement  Mr.  Jacob  Aaronson  (under 
supervision  of  Drs.  Alvarez  and  Huang)  worked  with  actual  military  dog  health  records  (scanned  by  Vet.  Tech.  Ms.  Perez 
at  Lackland  AFB)  to  create  those  custom  electronic  versions  of  paper  forms.  Specifically,  they  initiated  the  development 
of  custom  scanning  and  data  capture  from  DoD  military  dog  health  record  form  1 829  (which  are  generated  for  each  health 
visit,  providing  longitudinal  data)  and  from  AFIP/JPC  pathology  reports  (which  are  generated  for  essentially  all  diagnostic 
cancer  biopsies  and  sometimes  for  necropsy).  That  required  significant  efforts  from  ABBYY  support  and  Research  IT, 
NCHRl  to  implement.  This  effort  is  ongoing.  If  one  or  both  final  customized  forms  are  successful  in  the  near  future,  we 
will  be  able  to  scan  any  future  records  and  automatically  isolate  each  1 829  and  pathology  report.  Importantly,  we  would 
also  be  able  to  scan  the  many  prioritized  full  records  scanned  and  archived  in  our  database  in  “track  ii”.  (ii)  In  the  second 
track  that  was  initiated  in  2012  and  is  ongoing  through  2013,  we  have  used  different  indicators  to  prioritize  individual 
dogs  that  are  particularly  important  to  our  study  and  have  begun  scanning  their  complete  records  (except  for  some 
associated  clinical  test  data  that  could  not  be  scanned  -  e.g.,  EKG’s  on  thin  perforated  paper  (which  would  have  risked 
their  destruction  in  our  portable  automatic-feed  scanner).  We  are  mainly  focused  on  dogs  that  have  had  cancer  or  most 
likely  would  have  had  it  by  now  if  they  had  high  risk  (according  to  age).  We  thus  acquired  a  list  of  all  Lackland  AFB  dog 
health  records  for  which  there  are  AFIP/JPC  pathology  reports.  This  was  made  possible  by  our  primary  military  dog 
program  contact,  LTC  Cyle  Richard.  He  provided  us  that  list,  which  he  received  from  AFIP/JPC;  in  this  way,  we  did  not 
have  to  review  thousands  of  records  to  identify  those  that  contained  pathology  reports  or  cancer  diagnoses.  This  in  turn 
allowed  us  to  examine  DoD  military  dog  puppy  program  dog  (DoD  bred  dogs  vs.  purchased  dogs)  pedigrees  for  selection 
of  affected  and  unaffected  littermates  or  half  siblings.  From  this  analysis  we  identified  a  relatively  small  number  of 
popular  breeders  that  had  many  litters  with  different  partners. 


Task  3-Methodolgv  Development: 

Task  3  is  advanced  about  as  far  as  the  data  types  we  have  acquired  to  date.  Once  final  lACUC  approval  is  granted 
(expected  within  the  month)  and  we  begin  to  acquire  military  dog  samples  after,  we  expect  to  be  able  to  deploy  the 
methodologies  we  have  developed.  Specifically,  we  have  validated  the  principal  new  methods  using  data  from  previously- 
acquired  Greyhound  osteosarcoma  case  and  control  samples,  and  from  data  published  by  the  LUPA  Consortium  (Vaysse 
A  et  al.  2011.  Identification  of  genomic  regions  associated  with  phenotypic  variation  between  dog  breeds  using  selection 
mapping.  PLoS  Genet.  7(10):el002316.  PubMed  PMID:  22022279). 

In  the  first  year’s  Annual  Report,  we  included  two  manuscripts  (Rybaczyk  et  al.  and  Rowell  et  al.)  that  used  a  new 
methodology  we  developed  under  the  present  program.  Both  those  manuscripts  were  submitted  for  publication  in  leading 
genetics  journals,  and  we  have  been  addressing  reviewers  criticisms  and  advice.  Throughout  2013,  we  continued  to  refine 
and  validate  those  studies.  Specifically,  this  work  involves  the  invention  of  entirely  novel  techniques  to  conduct 
genomewide  association  analysis  or  GWAS  (Balding  2006)  and  multidimensional  statistical  analysis:  Intersection  Union 
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The  original  focus  of 
these  works  was  on  development 
of  the  lUT.  In  the  course  of 
improving  the  methods  to  address 
reviewer  comments  during  this 
reporting  year,  we  determined  that 
the  integration  of  Bootstrapping 
with  lUT  is  a  major  innovation 
and  advantage  (Fig.  1).  The 
greatest  concern  about  our 
manuscripts  was  that  the  lUT 
method  does  not  generate 
conventional  measures  of 
statistical  significance  (p-values), 
despite  the  fact  that  the  method 
empirically  ranked  lUT- 
“significant”  hits  correctly 
(according  to  detection  of  tme 
positives  in  published  datasets). 

[Notably,  that  is  the  major  focus 
of  applications  of  lUT  to  biology 
and  high  throughput  gene 
expression  data.  Some  have 
proposed  solving  it  using 
Bayesian  approaches,  but  after 
many  years,  no  one  has  had 
success  doing  so.]  By  adding 
Bootstrapping  upstream  of  lUT, 
we  are  able  to  give  another  type  of 
measure  of  robustness  of  results  - 
a  confidence  (vs.  significance) 
measure  (Bootstrap  Confidence 
Value,  BCV). 

In  this  reporting  year  we  discovered  strong  evidence  that  our  method  is  very  sensitive  and  specific  based  on 
analysis  of  the  genetic  contributions  to  the  complex  trait  of  dog  size  as  a  test  (using  the  Vaysse  et  al.  dataset  cited  above). 
Specifically,  we  reanalyzed  that  published  data  and,  not  only  identified  those  authors’  two  genomewide  significant  hits 
using  conventional  methods,  but  we  also  found  additional  lUT-genomewide  significant  hits  that  they  missed  (but  which 
have  been  shown  to  be  true  positives  in  other  canine  genetics  studies).  We  also  generated  new  evidence  that 
Bootstrap/IUT  methods  i)  have  increased  ability  to  detect  weak  signal  (a  critical  need  for  complex  genetics  such  as  cancer 
risk)  and  ii)  does  not  require  correction  for  population  structure  when  the  analysis  is  designed  properly.  We  did  this  by 
analyzing  the  most  complex  dog  trait  reported  by  Vaysse  et  al  (ref.  above)  -  sociability  (the  response  of  a  dog  when 
approached  by  another  dog  or  a  human)  as  a  test  (experimental  support  for  these  claims  were  provided  in  figures  within 
the  Q7  and  Q8  Quarterly  Reports). 

I  (Alvarez)  have  personally  presented  this  methodology  and  our  results  to  many  investigators  at  international 
meetings.  For  example,  at  the  premiere  Gordon  Conference  of  Human  Genetics  and  Genomics  (RI,  2013),  I  presented  the 
methodology  to  several  experts  in  diverse  areas  of  genetics/genomics.  That  included  arguably  the  most  important 
population  geneticist  in  the  field  of  dog  genetics.  Dr.  Carlos  Bustamante  (Stanford  U.).  He  was  intrigued  by  the  method 
(specifically  enquiring  about  its  ability  to  handle  population  structure,  which  we  address  in  our  revised  manuscript)  and  he 
was  equally  enthusiastic  about  the  present  project  studying  genetics  in  the  military  dogs.  He  expressed  interest  in 
discussing  more  and  in  collaborating.  I  also  presented  this  work  to  several  investigators  at  the  7th  International 
Conference  on  Advances  in  Canine  and  Feline  Genomics  and  Inherited  Diseases  (Broad  Inst.,  Cambridge,  MA,  2013). 

3 


Figure  1.  Schematic  of  integrated  Bootstrapping  and  Intersection  Union  Testing 
(lUT)  for  genetic  analysis.  (A)  The  schematic  on  the  left  shows  how  a  single  dataset  is 
repeatedly  subsampled  (with  replacement)  and  each  subsample  of  cases  and  controls  is 
then  put  through  the  lUT  compound  hypothesis;  i)  for  each  subset  of  an  lUT  group,  which 
genetic  markers  have  statistically  significant  frequency  differences  in  cases  and  controls, 
ii)  keep  only  the  markers  that  are  significant  in  all  subsets  of  an  lUT  (thus  not  requiring 
multiple  testing  correction).  Right  hand  notations  compare  our  methods,  which  are 
considered  hypothesis  tests,  to  analogous  approaches  in  the  field  of  Machine  Learning, 
which  are  not  considered  hypothesis  tests  but  rather  learning  or  predicting.  (B)  Illustration 
of  how  lUT  works  in  first  panel:  each  marker  (SNP)  is  tested  for  significance  in  each 
subset  of  an  lUT  group  (set  #1,2,  3);  only  those  significant  in  all  are  kept.  Second  panel 
illustrates  how  repeating  the  process  on  1000  Bootstrap  replicates  (4  shown)  can  be  used 
to  plot  the  proportion  of  times  a  marker  is  positive  in  the  1000  (scree  plot,  third  panel). 


Most  importantly  I  presented  this  to  one  of  the  top  two  canine  geneticists  in  the  world,  Dr.  Kerstin  Lindhlad-Toh,  Broad 
Inst.  (Harvard  U./MIT)  and  Karolinska  Inst.,  and  a  leader  of  the  LUPA  consortium  in  Europe.  She  and  her  leading  fellow 
in  her  group  were  both  very  interested  in  our  methods  and  results,  and  we  agreed  to  initiate  collaborations.  I  also 
presented  this  work  to  Dr.  Adam  Boyko  (Cornell  U.),  who  is  the  fastest-rising  investigator  in  the  field  of  canine  genetics. 
He  was  very  interested,  particularly  in  our  success  with  the  very  complex  trait  of  sociability  in  the  Vaysse  et  al.  dataset. 

We  are  making  the  final  revisions  to  resubmit  the  Rybaczyk  et  al.  and  Rowell  et  al.  manuscripts.  The  former  is 
only  a  matter  of  finalizing  the  writing  and  adjusting  the  figures  (all  analyses  are  complete).  The  latter  may  be  complete, 
but  the  new  collaboration  with  Dr.  Lindblad-Toh  (see  above)  has  presented  us  with  a  second  independent  osteosarcoma 
positive  and  negative  cohort  of  genotyped  dogs  of  the  same  breed.  We  may  therefore  use  that  data  to  conduct  further 
validation  studies  in  order  to  make  our  findings  more  robust  and  higher  impact. 

We  have  already  submitted  a  grant  application  to  the  National  Institutes  of  Health  on  this  topic  (it  was  not  funded) 
(Appendix  I).  Once  our  manuscripts  above  are  submitted,  we  will  revise  that  grant  application  and  resubmit  it. 


Task  4  Identification,  recruitment,  and  retention  of  cancer  bearing  and  control  dogs. 

Among  the  findings  from  analyzing  military  dog  records,  we  identified  osteosarcoma,  mammary  cancer  and  mast  cell 
cancer  in  the  population.  Examples  of  the  importance  of  pedigrees  follows:  (Example  i)  One  breeding  program  pair  with 
litters  within  a  very  high-age  group  of  10-13  years  of  age  (comparable  to  older  than  ~70  or  more  in  humans)  was  shown  to 
be  potentially  of  high  interest  because  she  had  cancer  (as  identified  by  being  on  the  AEIP/JPC  pathology  report  list  noted 
above).  However,  the  male  appears  to  have  been  a  “German  police  dog”,  a  fact  that  would  be  important  and  has  to  be 
verified  and  considered  in  light  of  the  experimental  design  (i.e.,  is  that  title  a  reference  to  origin  an  work  application  or  is 
it  a  German  Shepherd  breed  dog?).  (Example  ii)  Another  breeding  pair  with  high-age  litters  is  implicated  as  being 
important  because  it  had  two  offspring  in  one  litter  that  went  on  to  develop  cancer.  These  examples  illustrate  how  we  can 
triangulate  on  dogs  of  interest  and  controls  by  using  pedigree  information  and  AEIP/JPC  pathology  report  data.  In 
addition,  we  are  examining  the  parents  (and  their  other  offspring)  of  those  breeding  pairs  to  evaluate  the  possibility  of 
high  and  low  cancer  risk  lineages. 

In  this  section  of  the  first  annual  report,  we  described  how  we  validated  the  methods  for  genotyping  of 
osteosarcoma  and  aged  non-osteosarcoma  dogs  (using  retired  racing  Greyhound  DNA  isolated  from  samples  acquired 
through  another  funding  mechanism  and  its  associated  lACUC  approved  protocol).  There  we  also  presented  a  figure  (#1) 
of  the  analysis  that  used  Principal  Components  Analysis  to  access  separation  of  the  population  (which  we  classified  by 
cancer  status).  We  found  significant  difference  in  the  genetic  makeup  of  the  two  groups  and  were  therefore  satisfied  with 
our  selection  protocol.  That  work  is  included  in  the  Rowell  et  al.  manuscript  under  revision. 

In  our  second  to  last  Quarterly  Report  (Q7),  we  noted  that  Task  2f  involving  dogs  at  a  second  military  base  (Port 
Leonard  Wood)  was  not  necessary  at  that  time.  In  the  last  Quarterly  Report  (Q8)  that  all  the  dogs  and  environmental 
aspects  necessary  are  available  from  Lackland  AFB.  As  we  reported  there,  while  the  original  grant  application  contained  a 
letter  of  support  from  the  veterinary  service  at  Lackland  AFB,  they  did  not  disclose  to  us  significant  information  about 
their  Military  Working  Dog  program  until  the  grant  was  funded.  For  that  reason,  we  originally  proposed  to  sample 
environmental  differences  by  using  dogs  outside  of  Lackland  AFB  -  and  identified  the  population  of  Mine  Program  dogs 
at  Fort  Leonard  Wood  as  an  ideal  control  group  exposed  to  a  different  environment  than  dogs  at  Lackland  AFB.  Having 
Ms.  Perez,  our  Vet  Tech,  embedded  in  the  veterinary  service  of  the  military  dogs  at  Lackland  AFB  has  allowed  us  to 
evaluate  this  environmental  aspect.  We  have  concluded  that  the  Lackland  AFB  dogs  have  been  exposed  to  various 
locations  that  range  from  uniform  (housing  at  Lackland  AFB  Medina  Annex)  to  common  or  unique  deployments  (from 
which  some  dogs  are  returned  to  Lackland  AFB  or  Medina  Annex  temporarily  or  long  term). 


Task  5:  Molecular  Characterization  of  cancer  bearing  and  control  dogs 

Highly  detailed  description  and  discussion  of  this  was  provided  in  the  first  year  annual  report.  That  described  our 
development  of  the  appropriate  molecular  protocols  for  acquisition,  maintenance,  and  use  of  canine  samples.  In  this 
reporting  year,  we  developed  the  downstream  portion  of  this  analysis. 

We  developed  two  new  aspects  of  sequences  analysis  in  both  tumor-bearing  and  non-tumor  bearing  dogs,  i)  We 
optimized  high-throughput  product  purification  (including  DNA,  RNA,  and  PCR  product)  for  rapid  sequencing.  One  time 
consuming  step  of  the  laboratory  output  is  working  with  individual  samples.  We  have  thus  far  optimized  a  protocol  that 
produces  high-quality  DNA,  RNA,  and  PCR  products  at  an  individual  sample  level.  However,  as  we  begin  to  increase 
sample  number  and  identify  variants  of  interest  (SNP  array  results  generated  through  bioinformatics  analysis),  it  will 
become  necessary  to  validate  this  on  large  scale  of  dog  samples,  ii)  We  optimized  the  sequencing  approach  to  validate 
SNP  candidates  and  to  genotype  relatively  small  numbers  of  samples  (<40).  Larger  numbers  of  samples  will  likely  be 
genotyped  with  custom  TaqMan  assays,  but  we  have  determined  that  sequencing  is  the  most  flexible  and  cost-effective 
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approach  for  smaller  numbers.  By  sequencing  both  strands,  we  can  essentially  guarantee  100%  accuracy  (otherwise  we 
would  refer  to  a  nucleotide  site  as  ambiguous  or  “suggestive/sequence  evidence  on  one  strand”). 

In  a  previous  study,  we  (Alvarez)  identified  the  genetic  variant  responsible  for  the  brindle  coat  pattern  in  dogs.  In  the 
Rowell  et  al.  study  mentioned  above,  we  therefore  used  brindle  as  a  positive -control  trait  to  test  whether  we  could  use  the 
lUT/GIA  to  map  that  known  locus  in  our  Greyhound  cohort  (the  same  that  was  subsequently  used  to  map  osteosarcoma 
risk  loci).  That  was  a  great  success  as  the  nearest  polymorphic  SNP  in  the  cohort  was  the  top  brindle -associated  SNP.  We 
further  examined  the  canine  brindle  locus  as  an  opportunity  to  develop  the  molecular  modalities  planned  for  year  3.  We 
optimized  the  Bisulfite  sequencing  protocol  on  genomic  DNA  and  Real  Time  PGR  on  RNA.  We  narrowed  down  the 
methylation  status  of  regions  implicated  for  their  cellular  regulation  roles  to  a  single  specific  region  of  interest.  We 
correlated  DNA  methylation  status  and  mRNA  expression  levels.  That  work  seems  to  show  a  very  rarely  observed 
mechanism  whereby  a  long  non-coding  RNA  in  the  antisense  orientation  regulates  another  gene  epigenetically.  As  a 
result,  we  even  generated  a  new  protocol  that  allows  for  using  qPCR  to  test  strand-specific  expression  levels  of  RNA. 
That  work  is  presently  on  hold,  but  will  be  followed-up  for  publication  in  the  future. 

Lastly,  we  have  developed  approaches  to  conduct  cross-breed  analysis.  This  is  important  if  related  breeds  share  a 
phenotype  that  is  relatively  rare  in  other  breeds.  If  so,  the  most  important  variants  associated  with  the  phenotype  are  likely 
to  be  shared  in  the  related  breeds  but  not  the  other  breeds  (or  at  far  reduced  frequency).  To  that  end,  we  conducted 
analysis  of  candidate  osteosarcoma  risk  SNPs  (from  Greyhounds)  in  the  closely  related  breed  Scottish  Deerhound  (using 
samples  acquired  through  other  funding  mechanisms  and  associated  lACUC-approved  protocol).  Because  SNPs  are 
binary  markers,  single  SNPs  are  not  sufficiently  informative  to  support  haplotype  sharing.  We  thus  use  canine  SNP  data  to 
identify  the  nearby  DNA  sequence  with  the  highest  density  of  SNPs  and  use  that  to  design  -500-600  bp  PGR  assays  that 
include  2  or,  ideally,  more  polymorphic  SNPs. 


Task  6-  Adaptation  of  existing  resources,  data  storage  and  hosting: 

We  have  a  secure  virtual  machine  called  Research  DAPER  or  resdaper  developed  initially  by  Mr.  Camerlengo  and 
continued  by  his  replacement  Mr.  Aaronson  (supervised  by  Drs.  Alvarez  and  Huang).  The  machine  exists  on  the  secure 
NCHRI  (Alvarez)  network  behind  a  firewall.  It  can  only  be  accessed  by  highly-secure  VPN  using  two  factor 
authentication.  We  have  an  instance  Microsoft  SQL  Server  stored  on  the  machine.  Microsoft  SQL  Server  is  an  industry¬ 
leading  relational  database  product  that  we  use  to  store  all  of  our  documents  after  they  have  been  digitalized.  With  a 
relational  database,  you  can  quickly  compare  information  because  of  the  arrangement  of  data  in  columns.  The  relational 
database  model  takes  advantage  of  this  uniformity  to  build  completely  new  tables  out  of  required  information  from 
existing  tables.  In  other  words,  it  uses  the  relationship  of  similar  data  to  increase  the  speed  and  versatility  of  the  database. 
The  "relational"  part  of  the  name  comes  into  play  because  of  mathematical  relations.  Each  table  contains  a  column  or 
columns  that  other  tables  can  key  on  to  gather  information  from  that  table.  We  have  many  fields  that  we  can  filter  and  sort 
on  that  we  can  use  to  retrieve  items.  Ultimately,  this  will  include  all  clinical  and  associated  data,  environmental  data  and 
genetic  (genotype),  epigenetic,  and  genomic/molecular  (phenotype)  data.  The  user  interface  is  under  construction.  We  will 
have  a  web  user  interface  that  can  be  accessed  by  those  with  secure  credentials.  We  have  used  Microsoft  asp.net  MVC  to 
build  the  user  interface.  Using  the  model  view  controller  pattern  gives  us  the  benefit  of  separating  the  representation  of 
information  from  the  user's  interaction  with  it  .The  model  consists  of  application  data,  business  rules,  logic,  and  functions. 
A  view  can  be  any  output  representation  of  data,  such  as  a  chart  or  a  diagram.  The  controller  mediates  input,  converting  it 
to  commands  for  the  model  or  view. 

In  Task  2(i)  we  discussed  the  conversion  of  paper  health  records  to  digital  versions  using  ABBYY  software  - 
mainly  the  1829  form  and  the  AEIP/JPC  pathology  reports.  That  digitized  data  will  be  fully  accessible  and  searchable 
through  the  web  interface  mentioned  above.  In  addition,  the  Task  2(ii)  scanned  complete  veterinary  clinical  records  will 
be  directly  linked  as  PDE  format.  This  will  allow  analysis  of  digitized  data  with  the  option  of  follow-up  detailed  analysis 
of  full  health  records  on  the  same  database/tools  ensemble  “resdaper”  (or  confirmation/cross-validation  of  critical  data). 
We  have  thus  installed  the  ABBYY  ElexiCapture  software  and  all  of  the  components  which  include  The  Processing 
Server.  That  is  the  server  that  controls  the  operation  of  the  Processing  Stations.  We  installed  the  Licensing  Server,  the 
server  that  stores  and  manages  licenses.  We  installed  the  Application  Server,  the  server  that  controls  the  operation  of  the 
other  components.  We  installed  the  Application  Server  components,  which  will  allow  operators  to  connect  to  the  server 
and  work  using  a  web-browser.  We  also  have  the  Application  Server  component  which  allows  operators  of  web  stations  to 
register  with  the  system  and  create  requests  for  access  rights  to  the  web  station.  It  provides  operators  of  web  stations  with 
a  single  entry  point  into  the  system. 


Task  7:  Pathway  analysis  and  functional  characterization. 
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Task  7a  is  complete.  I  (Alvarez)  have  been  conducting  extensive  data  mining  and  analysis  that  are  honing  those  skills 
which  will  ultimately  be  applied  to  the  study  of  cancer  in  military  dogs.  That  includes  work  on  osteosarcoma  risk 
candidate  genes  from  Greyhounds  (to  be  published  in  Rowell  et  al.  manuscript  mentioned  above)  and  LUPA  candidate 
genes  for  multiple  canine  traits  (also  discussed  above).  Most  importantly,  the  Greyhound  study  implicated  small  genomic 
regions  with  one  or  two  genes  each.  This  allowed  use  of  human  cancer  data  and  analysis  servers  to  predict  which  were 
likely  to  be  cancer  genes  and  whether  the  human  evidence  suggested  the  cancer  risk  gene  variant  was  likely  to  result  in  up 
or  down  regulation.  For  example,  the  IntoGen  server  permits  analysis  of  gene  expression  and  genome  alterations 
associated  with  diverse  cancer  types.  But  other  analysis  servers,  such  as  NextBio,  Oncomine,  KMplot  and  BioGPS 
provide  different  tools  to  mine  the  same  gene  expression  data  in  very  different  ways.  For  example  NextBio  make  meta¬ 
analysis  of  any  subset  of  studies  and  KMplot  generates  Kaplan  Meier  survival  plots  for  a  subset  of  cancer  types  that  have 
very  large  numbers  of  data  available.  With  this  data  in  hand,  it  is  possible  to  generate  hypotheses  and  to  conduct  cross- 
validation  studies.  For  example,  in  the  Greyhound  osteosarcoma  case,  we  can  test  those  predictions  by  analyzing  genetic 
association  candidates  in  a  canine  osteosarcoma  tumor  gene  expression  dataset  which  includes  Greyhound,  Golden 
Retrievers,  Rottweiler’s  and  mixed  breed  dogs.  Because  there  are  orders  of  magnitude  more  human  data  than  canine,  it  is 
critical  to  be  able  to  make  use  of  it. 

Among  the  major  aspects  of  genetic/genomic  studies  are  contextualization  according  to  biochemical  or  genetic 
pathways,  cross-dimensional/platform  validation,  and  comparative  genomics/cross-species  validation.  To  that  end,  I  have 
conducted  studies  in  these  aspects  of  cancer  genetics.  Among  those,  I  mined  for  genetic  evidence  that  the  enzyme 
aldehyde  dehydrogenase  is  involved  in  multiple  myeloma  (for  which  there  is  experimental  evidence  generated  by  a 
collaborator  studying  this  with  their  own  funding).  As  a  result  of  the  latter  analysis,  my  analyses  were  added  to  a 
manuscript  that  was  recently  accepted  for  publication.  Although  the  following  work  was  not  based  on  our  military  dog 
data,  my  contributions  involve  the  same  analyses  that  will  be  conducted  with  canine  cancer  candidate  genes:  Yasmeen  R., 
Meyers  J.  M.,  Alvarez  C.  E.,  Thomas  J.  L.,  Bonnegarde -Bernard  A.,  Alder  H.,  Papenfuss  T.  L.,  Benson  D.  M.  Jr,  Boyaka 
P.  N.,  Ziouzenkova  O.  (2013)  Aldehyde  dehydrogenase- lal  induces  oncogene  suppressor  genes  in  B  cell  populations. 
Biochi m  Biophys  Acta  1833:3218-3227.  (See  Appendix  II)  For  example,  I  conducted  the  analysis  shown  in  Figs.  6A  and 
6C.  That  critical  information  shows  that  the  biology  suggested  by  the  Yasmeen  et  al.  molecular/biochemical  study  can  be 
cross-validated  by  public  datasets  involving  other  types  of  evidence  (here  gene  expression).  Similarly,  we  expect  that  the 
vast  data  available  on  human  cancers  will  yield  supporting  evidence  for  canine  cancer  findings  from  the  project  that  is  the 
subject  of  this  report. 


Task  8-  Project  management.  Quality  control  and  assurance,  and  Security: 

The  most  important  change  in  this  reporting  year  is  the  execution  of  the  CRADA’s  which  allowed  us  to  acquire 
DoD  military  dog  data.  We  established  a  footprint  at  Lackland  and  implemented  security  protocols  in 
accordance  with  our  agreements.  We  are  conducting  quality  control  evaluations  for  our  data  collection 
techniques  to  assure  that  we  are  collecting  appropriate  data.  Once  we  have  assured  high  quality  data  we  will 
begin  automated  import  into  the  database.  We  are  also  cross -validating  medical  and  pathology  records  to  assure 
accurate  diagnosis.  We  initiated  collaborations  with  Dr.  David  Gutman  at  Emory  University  and  hope  to  use  his 
automated  pathology  data  base  to  facilitate  confirmation  of  sample  classification. 

As  of  June  1st,  2013,  Task  8  duties  attributed  to  Dr.  Rybaczyk  (who  has  moved  on  in  his  academic 
career,  as  an  NIH  T32  Fellow,  Michigan  State  U.)  are  being  done  by  Dr.  Alvarez.  This  transition  was  been 
smooth.  A  job  listing  was  posted  for  a  replacement  postdoctoral  fellow.  Dr.  Alvarez  interviewed  a  highly- 
qualified  postdoctoral  fellow  named  Dr.  Sohan  Lal  (currently  postdoctoral  fellow  at  Yale),  but  unfortunately  Dr. 
Lal  was  forced  to  accept  another  position  at  Yale  due  to  imminent  expiration  of  his  visa  status.  There  is  another 
candidate  under  consideration;  the  goal  is  to  hire  that  person  prior  to  initiating  the  biological  sample  collection. 

The  replacement  for  Mr.  Camerlengo  -  computer  programmer  -  was  a  success.  His  role  has  been  taken 
up  by  Mr.  Jacob  Aaronson,  who  may  not  be  as  experienced  as  Mr.  Camerlengo  but  appears  to  have  greater 
affinity  for  the  biomedical  aspects  of  computational  sciences.  Mr.  Aaronson  quickly  completed  his  NCHRI 
orientation,  security  clearance/ID  badge,  and  vaccination  requirements.  Most  importantly,  he  rapidly  oriented 
himself  in  the  project  and  is  performing  high  quality  work. 


Key  Research  Accomplishments 

•  Execution  of  institutional  agreements  (CRADA’s)  between  NCHRI  (Alvarez)/OSU  (Huang,  Couto) 
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•  Completion  of  all  facets  of  lACUC  between  NCHRI  and  Lackland  AFB  through  final  Lackland  AFB  oversight 
approval  (currently  waiting  for  final  ACURO  approval  expected  within  ~1  month) 

•  Successful  embedding  of  NCHRI  (Alvarez)  Veterinary  Technician,  Ms.  Michelle  Perez  within  the  military  dog 
health  service  at  Lackland  AFB 

•  Successful  scanning  of  veterinary  clinical  records  by  Ms.  Perez  at  Lackland  AFB,  transmission  of  encrypted  data 
to  NCHRI,  and  uploading  to  DAPER  database 

•  Continued  development  and  validation  of  a  scale  free,  high-power  statistical  methodology  capable  of  resolving 
signal  from  noise  in  high  throughput  genetic/genomic  data  (lUT/GIA)  by  incorporation  of  Bootstrapping 

•  GIA  manuscripts  continue  to  be  refined  since  receiving  comments  from  peer  reviewers 

•  GIA  grant  application  to  NIH  is  being  refined  based  on  peer  reviewer  critiques 

•  Expansion  of  our  highly  flexible  data-infrastructure  that  is  robust  enough  to  handle  military  working  dog  records 
and  queries  of  said  records 

•  Initiation  of  high  through-put  software  customization  (ABBYY  ElexiCapture)  for  analysis  of  1829  longitudinal 
veterinary  records  and  AEIP/JPC  pathological  records 

•  Initiation  of  DoD  military  dog  pathology  reports  to  identify  cancer  bearing  dogs  for  cancer  classification  and 
selection  of  cases  and  controls 

•  Initiation  of  DoD  military  dog  “puppy  program”  pedigree  analysis  for  identification  of  high  and  low  cancer  risk 
lineages 

•  Development  of  a  collaboration  with  Dr.  Kerstin  Lindblad-Toh,  Broad  Inst.  (Harvard  U./MIT) 


Reportable  Outcomes 

•  Dr.  Jennie  Rowell,  having  received  her  PhD  from  OSU  for  her  work  at  NCHRI  (Alvarez),  joined  the  lab  of  one  of 
two  pre-eminent  dog  geneticists  in  the  world,  Elaine  Ostrander,  NIH,  as  postdoctoral  fellow.  The  first  week  of 
Nov.  2013,  she  has  a  job  interview  for  a  tenure  track  position  at  the  College  of  Nursing,  OSU 

•  Expansion  of  DAPER  database  capabilities  maintaining  strong  security 

•  Dr.  Rybaczyk  moved  on  to  take  an  NIH  T32  Eellow  position,  Michigan  State  U.  (from  which  he  is  expected  to  be 
promoted  to  faculty  once  he  acquires  grant  funding  within  approximately  one  year) 

•  Publication  of  Yasmeen  et  al.  manuscript  which  included  Dr.  Alvarez’s  powerful  cancer  datamining  and  analyses 
that  is  similar  to  what  will  be  applied  to  the  military  dog  cancer  studies  in  year  3 

•  Participation  by  Dr.  Alvarez  in  the  leading  international  Gordon  Conference  of  Human  Genetics  and  Genomics 
(which  requires  one  to  submit  an  application  and  be  selected  as  an  active  contributor  to  the  field)  (RI,  2013) 

•  Participation  by  Dr.  Alvarez  in  the  7th  International  Conference  on  Advances  in  Canine  and  Eeline  Genomics  and 
Inherited  Diseases  (Broad  Inst.,  Cambridge,  MA,  2013) 

•  Dr.  Couto  retired  from  OSU  effective  Sept.  U‘,  2013  and  created  his  own  consulting  company 

•  Dr.  Alvarez  was  promoted  to  Associate  Professor  with  tenure  by  OSU 

•  Dr.  Alvarez  was  offered  a  Scientist  position  (equivalent  to  Associate  Professor)  at  the  Cancer  Center,  Sanford 
Research,  Sioux  Palls,  SD.  [Discussions  are  ongoing,  but  he  is  happy  at  NCHRFOSU  and  plans  to  stay  through 
the  completion  of  this  project.] 

•  Dr.  Alvarez’s  Center  Chair  and  Division  Head  endorsed  his  application  for  leadership  training  [OSU  College  of 
Medicine’s  Center  for  Faculty  Advancement,  Mentoring  and  Engagement  (PAME)  2014-2015  Paculty  Leadership 
Institute]  (application  pending) 


Conclusion 


The  project  accelerated  when  the  CRADA’s  were  executed.  In  the  first  two  years,  we  optimized  the  primary  genotyping 
and  molecular  methods,  and  the  follow-on  validation  methods.  We  also  expanded  the  capabilities  of  our  highly-flexible 
DAPER  database  and  software  tools  in  the  present  reporting  year.  In  the  first  year  we  invented  an  entirely  novel  approach 
to  conducting  genome  wide  genetic  association  (GW A)  analysis  -  genomewide  lUT  analysis  (GIA);  and  in  the  second 
year  we  further  validated  it.  In  this  second  reporting  year,  we  integrated  lUT  and  Bootstrapping  as  an  additional 
innovation  with  outstanding  utility.  Dr.  Alvarez’s  presentation  of  these  methods  and  results  to  leaders  in  the  fields  of 
genetics  and  canine  genetics  resulted  in  uniformly  positive  feedback  from  them  (and  multiple  requests  for  collaboration). 
We  expect  to  publish  the  two  revised  manuscripts  on  GIA  (one  on  methods,  one  on  empirical  cancer  mapping)  shortly,  but 
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the  latter  may  be  delayed  while  we  analyze  new  supporting  data  acquired  from  Dr.  Lindblad-Toh.  In  addition,  we  co¬ 
authored  (Alvarez)  a  published  study  that  was  not  based  on  the  present  military  dog  project,  but  which  made  use  of  the 
same  datamining  and  analysis  methods  that  will  be  used  in  our  study.  Dr.  Rowell,  one  of  our  investigators  (originally  as  a 
predoctoral  student),  moved  on  to  conduct  a  postdoctoral  fellowship  with  a  pre-eminent  dog  geneticist  at  NIH  and,  after 
only  a  year  there,  is  being  recruited  for  a  tenure  track  faculty  position  at  OSU.  Dr.  Rybaczyk,  another  of  our  investigators 
(originally  a  postdoctoral  fellow  and  promoted  to  research  scientist)  went  on  to  be  an  NIH  T32  Fellow  at  MSU,  which  is 
essentially  a  pre -faculty  position.  Dr.  Alvarez  was  promoted  to  Associate  Professor  with  tenure  by  OSU  and  is  now  under 
consideration  for  leadership  training  in  the  OSU  College  of  Medicine. 


References 


Balding,  D.  J.  (2006).  "A  tutorial  on  statistical  methods  for  population  association  studies."  Nat  Rev  Genet  7(10):  781- 
791. 

Berger,  R.  L.  (1982).  "Multiparameter  Hypothesis  Testing  and  Acceptance  Sampling."  Technometrics  24(4):  295-300. 

Berger,  R.  L.  (1997).  Likelihood  Ratio  Tests  and  Intersection-Union  Tests.  Advances  in  statistical  decision  theory  and 
applications.  S.  Panchapakesan,  N.  Balakrishnan  and  S.  S.  Gupta.  Boston,  Birkhauser. 

Vaysse,  A.,  A.  Ratnakumar,  et  al.  (2011).  "Identification  of  Genomic  Regions  Associated  with  Phenotypic  Variation 
between  Dog  Breeds  using  Selection  Mapping."  PLoS  Genet  7(10):  el 0023 16. 

Yasmeen  R.,  Meyers  J.  M.,  Alvarez  C.  E.,  Thomas  J.  L.,  Bonnegarde-Bemard  A.,  Alder  H.,  Papenfuss  T.  L.,  Benson  D. 

M.  Jr,  Boyaka  P.  N.,  Ziouzenkova  O.  (2013)  Aldehyde  dehydrogenase- lal  induces  oncogene  suppressor  genes  in 
B  cell  populations.  Biochim  Biophys  Acta  1833:3218-3227. 


9 


Appendices 


I.  Submitted  National  Institutes  of  Health  grant  application  including  Intersection  Union  Testing 
methodology  (Aims  2,  3):  Statistical  techniques  for  optimized  design  and  power  in  high-content  genomics 
(Alvarez,  PI;  Huang,  co-PI). 

II.  Accepted  publication:  Yasmeen  R.,  Meyers  J.  M.,  Alvarez  C.  E.,  Thomas  J.  L.,  Bonnegarde-Bemard  A., 
Alder  H.,  Papenfuss  T.  L.,  Benson  D.  M.  Jr,  Boyaka  P.  N.,  Ziouzenkova  O.  (2013)  Aldehyde 
dehydrogenase- lal  induces  oncogene  suppressor  genes  in  B  cell  populations.  Biochim  Biophys  Acta 
1833:3218-3227. 


10 


Descriptive  Titie:  Statistical  techniques  for  optimized  design  and  power  in  high- 
content  genomics 

Submission  Title: 


Opportunity  ID:  PAR-09-219 

Opportunity  Title:  Exploratory  Innovations  in  Biomedical  Computational  Science  and 
Technology  (R21) 

Agency  Name:  National  Institutes  of  Health 


2.  SPECIFIC  AIMS.  This  application  is  in  response  to  PAR-09-219,  Exploratory  Innovations  in  Biomedical 
Computational  Science  and  Technology;  it  address  research,  development  and  application  of  analytical  and 
statistical  tools  for  interpretation  of  large  biological  data  sets,  and  associated  software.  The  flood  of  biological 
data  has  highlighted  limitations  to  signal  detection.  Here  we  propose  that  combining  optimized  experimental 
design  and  novel  uses  of  statistical  methods  can  dramatically  increase  the  power  of  signal  detection.  These 
approaches  will  be  applicable  to  myriad  data  types  and  their  integration.  However,  this  proposal  will 
demonstrate  validity  using  a  highly  innovative  approach  to  complex  genetics.  We  will  conduct  a  Genome  Wide 
Association  (GWA)  study  using  high  density  genotyping  that  not  only  provides  binary  single  nucleotide 
polymorphism  (SNP)  allele  data,  but  also  total  SNP  signal  and  allele  ratios  (which  can  be  affected  by  DNA 
copy  number  variation,  CNV).  In  Preliminary  Studies  we  demonstrate  the  feasibility  of  using  allele  ratios  as 
continuous  variables  to  map  disease  loci.  This  is  the  first  such  GWA  study  of  comprehensive  CNV  information 
without  prior  classification  of  markers  as  CNV.  Our  hypothesis  is  that  implementation  of  our  algorithm  on 
multiple  (experimentally  standardized)  groups  dramatically  increases  the  power  to  detect  biological  signal. 

Experimental  design.  The  now  common  use  of  thousands  or  tens  of  thousands  of  subjects  in  genetic 
studies  can  be  attributed  to  genetic  heterogeneity/complexity  and  diverse  confounds  of  meta-analysis.  A  major 
limitation  is  the  extreme  multiple-testing  burden  in  GWA,  which  is  commonly  done  by  Chi-Square  testing  of  one 
million  markers.  In  Preliminary  Studies,  we  address  these  issues  by  1)  conducting  complex  disease  mapping 
studies  in  one  dog  breed,  which  has  100-fold  reduced  genetic  variation  compared  to  humans,  and  2)  using 
multiple,  but  experimentally  identical,  case-control  sets  or  batches.  In  this  way,  there  are  reduced  numbers  of 
disease-associated  markers  in  a  simpler  background  and  we  can  apply  an  Intersection  Union  Test  (lUT) 
across  experiments  (in  place  of  Bonferroni  multiple-test  correction).  Computational  statistics.  The 
overarching  goal  of  the  proposed  analytical  approaches  is  based  on  the  information  theory  concept  that  the 
more  manipulations  or  corrections  are  implemented,  the  more  information  is  lost.  We  propose  here  that  this 
loss  of  information  can  be  eliminated  in  diverse  types  of  biological  data  by  integrating  two  elements.  In  the  first, 
we  use  analysis  of  covariance  (ANCOVA)  to  correct  continuous  variable  data  for  latent  known  biological 
confounders  such  as  group  membership.  In  the  second,  we  make  use  of  optimized  study  design  (specifically, 
using  multiple  case-control  groups  for  a  given  experiment)  to  perform  lUT.  Others  recently  validated  a  similar 
use  of  lUT  independently.  In  Preliminary  Studies,  we  demonstrate  validation  of  the  integrated  ANCOVA  and 
lUT.  We  confirm  that  the  use  of  lUT  on  multiple  sets  is  a  more  effective  solution  to  the  three  reversal 
paradoxes  (Yule-Simpson,  Lord's,  and  suppression)  which  share  the  characteristic  that  the  association 
between  two  variables  can  be  reversed,  diminished,  or  enhanced  when  another  variable  is  statistically 
controlled  for.  Notably,  we  are  first  to  address  these  in  the  context  of  continuous  genomic  variables. 

Aim  1:  Demonstrate  on  large  datasets  the  ability  of  ANCOVA  to  correctly  identify  biologically  relevant 
phenomena  that  are  linked  to  a  disease  trait.  ANCOVA  has  been  applied  to  correct  for  baseline  variables  in 
various  fields,  such  as  psychology  and  epidemiology.  Despite  similarities  in  variable  types,  data  structure,  and 
confounds,  ANCOVA  has  never  been  applied  to  large  scale  genetic  datasets.  We  will  analyze  different  types  of 
genomic  datasets  (our  own  and  from  the  public  domain)  with  well-established  population  confounds  and  show 
that  ANCOVA  is  the  most  effective  way  of  removing  those. 

Aim  2;  Application  of  lUT  for  genetic  analysis,  allowing  for  multiple  corrections  without  manipulation  of 
individual  datasets.  We  propose  to  demonstrate  the  ability  of  lUT  to  detect  complex  genetics  in  a  disease 
phenotype  and  how  combining  lUT  with  ANCOVA  will  allow  the  detection  of  genetic  determinants.  The  non- 
obvious  advancement  of  this  method  is  that  it  incorporates  information  theory  by  minimally  altering  the  data 
before  analyzing  it.  This  retains  the  maximum  amount  of  information  for  each  measure.  It  also  does  not 
assume  linear  relationships  with  latent  variables. 

Aim  3;  We  will  validate  our  claim  that  ANCOVA  and  lUT  are  more  powerful  than  traditional  techniques. 

We  will  replicate  a  published  canine  complex-genetics  mapping  study  using  fewer  individuals  to  demonstrate 
that  our  technique  is  able  to  detect  the  same  loci  in  addition  other  variants  missed  by  traditional  techniques. 
We  will  also  conduct  a  novel  GWA  study  of  a  human  medically  relevant  complex  trait  in  a  second  dog  breed. 


3.  RESEARCH  STRATEGY 


(a)  SIGNIFICANCE 

We  will  develop  and  implement  analytical  and  statistical  tools  (and  software)  for  interpretation  of  large 
biological  data  sets.  The  explosion  of  biological  data  has  made  prominent  several  limitations  to  signal 
detection.^  We  demonstrate  in  Preliminary  Studies  that  combining  optimized  experimental  design  and  novel 
application  of  statistical  approaches  can  dramatically  improve  signal  detection.  These  methodologies  will  be 
applicable  to  analytical  challenges  of  myriad  data  types  and  their  integration  [^],  including  genomics  [^],  high 
throughput  (HT)  sequencing  ['^],  population  biology  and  genetics  and  gene/organism/environment 
interactions  f].  The  improvements  described  here  address  the  basic  concept  of  information  theory  that  more 
manipulations  of  data  equals  more  information  loss.  Among  the  areas  addressed,  are  1)  application  of  analysis 
of  covariance  (ANCOVA;  [®])  to  correct  continuous  variable  data  for  latent  known  biological  confounders  as  well 
as  potentially  avoiding  the  three  reversal  paradoxes  (Yule-Simpson,  Lord's,  and  suppression),  which  share  the 
characteristic  that  the  association  between  two  variables  can  be  reversed,  diminished,  or  enhanced  when 
another  variable  is  statistically  controlled  for  and  2)  multiple  new  applications  of  the  Intersection  Union 
Test  (lUT;  [^^]),  including  GWA,  as  was  independently  developed  by  another  investigator  very  recently  [^^].  This 
proposal  thus  offers  solutions  and  software  to  address  critical  barriers  to  genomic  analysis,  simultaneously 
improving  scientific  knowledge  and  technical/analytical  capabilities. 

(b)  INNOVATION 

Multiple  phenotypic  traits  (such  as  height  or  weight)  are  often  treated  as  independent  from  the  effect  under 
study,  but  that  neglects  the  reality  that  many  traits  are  linked  to  other  genetic  and  environmental  modifiers. 
Others  incorporate  and  calculate  variances  based  on  environmental  or  geographic  stratifications.  However,  this 
ignores  synergism  between  the  organism,  its  immediate  surroundings,  and  the  greater  environment.  While  it  is 
not  possible  to  measure  and  analyze  every  part  of  the  environment,  some  baseline  state  must  be  identified 
from  which  deviation  can  be  measured  to  test  a  priori  hypotheses.  In  the  absence  of  this  uniform  baseline, 
almost  all  statistical  measures  will  fail  to  adequately  detect  regions  of  interest.  This  application  will 
demonstrate  feasibility  and  innovation  in  preliminary  studies  (c.5)  using  an  entirely  new  approach 
(ANCOVA/lUT)  to  conducting  genome  wide  association  (GWA)  genetics  based  on  continuous  variable 
data.  An  important  challenge  to  GWA  that  relates  to  these  issues  above  is  population  structure  (i.e.,  correcting 
genetic  studies  for  non-disease-associated  allele  frequencies  that  vary  in  human  populations).  Two  common 
ways  to  address  this  are  traditional  meta-analytic  techniques  and  lUT.  But  these  approaches  are  selected 
more  out  of  necessity  than  experimental  design  concerns.  The  majority  of  combinatorial  studies  have  focused 
on  publicly  available  datasets.  Each  of  the  individual  datasets  contains  differing  degrees  of  artifactual  bias  and 
other,  potentially  unrelated,  variables.  Oncomine’s  and  other  algorithms  applying  this  strategy  to  gene- 
expression  have  some  success  but  it  has  not  been  the  panacea  originally  prognosticated.^'^ 

Multivariate  and  integrative  analyses  can  potentially  solve  many  issues  associated  with  genome  wide 
studies. However,  they  are  limited  by  their  ability  to  synthesize  data  into  useful  parcels  of  information  that 
are  applicable  clinically  or  to  research.  Integrative  analysis  has  the  benefit  of  alternative  testing.  While  multiple 
testing  using  the  same  measures  and  techniques  increases  error  rates  [^^],  alternative  testing  allows 
measurement  of  the  same  effect  using  different  types  of  measures.  As  these  are  subjected  to  different  analytic 
techniques,  the  posterior  probability  of  false  positives  is  reduced.  Even  with  this  strength,  it  is  limited  by  biases 
and  assumptions  associated  with  individual  measures.  Ultimately  the  question  of  how  to  appropriately  identify 
genetic  contributions  independent  of  latent  confounds  has  not  been  conclusively  answered.  The  gold  standard 
for  analyses  is  univariate  testing.  While  geneticists  talk  about  penetrance  in  relation  to  populations  and 
percentages,  the  statistical  actuality  is  that  penetrance  describes  odds  ratios.  Establishing  causation  and 
deviation  from  population  norms  using  case-control,  linkage,  or  association  analyses  requires  certain 
assumptions  to  be  accepted  that  biologically  may  or  may  not  be  perilous  to  the  analysis.  While  this  is  important 
to  ethologists  and  population  geneticists,  attempting  to  compensate/account  for  these  phenomena  hinders  and 
complicates  analyses.  We  are  interested  in  identifying  biological  outcomes  that  are  well  described  and  were 


not  concerned  with  tangential  characteristics  of  the  effect.  To  this  end,  we  sought  to  isolate  rather  than 
compensate  for  effects.  When  examining  multidimensional  data  it  is  easy  to  disregard  the  interaction  of 
dimensions.  Most  dimensional  reduction  techniques  measure  and  condense  data  so  that  interdimensional 
effects  can  be  quantified.  Priming  effects  can  drastically  alter  these  techniques  and  limit  their  usefulness.  For 
this  reason  we  applied  ANCOVA  [®]  to  remove  independent  effects  from  dependent  effects  prior  to  dimensional 
reduction.  Here  we  show  adjusted  and  un-adjusted  measures  to  illustrate  how  the  application  of  ANCOVA  prior 
to  traditional  techniques  is  capable  of  increasing  the  sensitivity  of  a  study,  as  well  as  the  potential  to  correct  for 
the  reversal  paradoxes  (c.5.  P.S.,  Study  Design)  by  comparison  to  traditional  normalization  techniques. 

(c)  APPROACH 

C.1.  Research  team.  The  multidisciplinary  team  is  ideally  suited  for  this  project.  Dr.  Alvarez  (PI)  is  PI  in 
Molecular  and  Human  Genetics,  Nationwide  Children’s  Hospital  Research  Institute,  with  a  tenure  track 
academic  appointment  at  The  Ohio  State  University  College  of  Medicine.  He  has  extensive  expertise  in 
molecular  and  human  genetics  and  genomics,  bioinformatics,  and,  from  management  level  industry  experience 
(Novartis  Research),  the  discovery  and  validation  of  new  drug  targets  and  biomarkers.  Dr.  Leszek  Rybaczyk 
(Research  Scientist,  Alvarez  Lab)  is  expert  in  statistical  bioinformatics.  Dr.  Huang  Kun  (Co-I)  is  co-director  of 
the  OSU-CCC  Biomedical  Informatics  Shared  Resource.  His  research  is  focused  on  developing  bioinformatics 
tools  for  systems  biology  and  research.  Here  he  will  be  responsible  for  developing  and  implementing  the 
software  package.  The  advanced  statistics  expertise  will  come  from  a  long  term  collaborator  of  the  three 
investigators  named  above.  Dr.  Pramod  K.  Pathak  (consultant,  MSU).  He  is  a  theoretical  and  applied 
statistician  with  specific  interests  in  statistical  methods  and  their  applications  to  biomedical  research,  sampling 
and  resampling  methods,  computational  statistics,  reliability,  and  optimization  problems  in  statistics. 

C.2.  Research  strategy  (RS).  Note:  As  the  approach  has  statistical  components  addressing  different  biology, 
we  will  explain  the  approach  once,  in  Research  Strategy,  and  establish  feasibility  in  Preliminary  Studies. 

RS  Aim  1.  We  propose  to  address  these  gaps  by  applying  statistically  proven  methodologies  in  novel  ways. 
ANCOVA  has  been  applied  in  various  fields  such  as  psychology  and  epidemiology  to  correct  for 
baseline  variables. Despite  the  similarities  in  variable  types,  data  structure,  and  problems  with  confounds 
ANCOVA  has  never  been  applied  to  large  scale  genetic  datasets.  Aim  1:  Demonstrate  on  a  large  dataset  the 
ability  of  ANCOVA  to  correctly  identify  biologically  relevant  phenomena  that  are  linked  to  a  disease  trait.  The 
rationale  and  technical  approach  for  this  aim  are  well  elaborated  in  c.5.  Preliminary  Studies.  Canine  genetic 
data  similar  to  those  generated  in  Preliminary  studies  will  be  generated  from  1)  36  Scottish  Deerhounds:  18 
osteosarcoma  cases  and  18  controls  (i.e.,  three  case-control  batches  of  six  and  six),  as  well  as  2)  36 
Doberman  (18  with  cervical  spondylomyelopathy  and  18  controls  (i.e.,  three  case-control  batches  of  six  and 
six).  In  addition,  we  will  analyze  diverse  genomic  datasets  from  the  public  domain  (including  human  SNP 
GWA,  gene  expression,  and  HT-sequencing).  For  example,  by  using  TCGA  data,  in  which  the  same  patient’s 
tissue  was  assayed  on  different  microarrays  in  different  laboratories,  using  an  ANCOVA  approach  we  will 
identify  the  most  biologically  relevant  factors.  We  will  expand  that  by  looking  not  only  at  the  cancer  type,  but 
also  at  the  laboratory  where  the  tissue  was  processed;  the  date  on  which  it  was  processed,  etc.,  and 
identify/potentially  remove  such  intrinsic  errors. Power  analysis.  Based  on  our  ongoing  genetic  studies  (see 
Preliminary  Studies),  we  assumed  that  potentially  relevant  SNPs  will  reduce  the  total  of  173,000  SNPs  to  1700 
[MD  Anderson  Bioinformatics  server  with  power  of  0.8,  acceptable  false  positives  of  1 ,  SD  of  0.7.  With  the 
sample  size  of  36  dogs  in  each  breed  (18  cases  and  18  controls)  we  will  have  80  %  to  detect  2-fold  differences 
in  B  allele  frequency  between  cases  and  controls  for  candidate  SNPs  of  interest  (per  SNP  alpha  =  0.00059). 
This  is  conservative,  as  ANCOVA  and  lUT  would  only  reduce  the  variance. 

RS  Aim  1  Potential  pitfalls  and  contingencies.  (1)  A  limitation  to  using  the  integrated  ANCOVA/lUT 
on  biological  data  is  that  it  is  only  applicable  for  continuous  variable  data.  While  this  excludes,  say, 
conventional  binary-genotype  GWA  analysis,  we  address  this  need  with  the  development  of  an  lUT-alone 
approach;  this  use  is  now  validated  by  us  (see  c.2.  RS  Aim  3  Expected  results.  Example  1)  and  by  a  second 
independent  group. Moreover,  much  genetic  data  (e.g.,  array  CGH,  HT-sequencing)  and  most  genomic  data 
has  continuous  variables  (microarray  and  HT-sequencing  based  RNA  expression  and  epigenetics,  proteomics, 
metabolomic,  etc.).  (2)  Another  potential  concern  is  the  need  for  clear  understanding  of  appropriate  data 
structure.  For  that  reason,  we  chose  to  make  this  proposal  not  only  about  the  statistical  methods,  but  also 


about  experimental  design.  We  will  make  a  major  effort  to  document  the  proper  use  of  these  algorithms  in 
publications  and  software  Help  documentation.  (3)  Lastly,  these  methods  are  computationally  intensive.  This 
will  not  affect  us,  as  Dr.  Huang  (Co-1)  is  Director  of  Bioinformatics  and  has  access  to  the  OSD  Supercomputer 
Center.  Despite  the  computational  demands,  the  methods  proposed  here  offer  analytical  abilities  that  are 
unique  and  state  of  the  art,  and  are  sure  to  gain  wide  use.  We  believe  that  our  optimization  studies  and  careful 
statistical/software  instructions  will  facilitate  the  most  efficient  implementation  of  our  algorithms. 

RS  Aim  2.  A  second  statistical  technique,  the  Intersection  Union  Test,  has  been  gaining  use  in  the 
genomics  field. The  lUT  increases  power,  but  also  increases  type  I  error  as  the  number  of  comparisons 
increases. However,  because  of  the  many  latent  confounds  that  cannot  be  accounted  for  in  most  genomic 
work,  the  lUT  is  the  most  elegant  solution  to  reducing  these  errors. For  instance,  in  large  datasets  where  a 
multitude  of  tests  are  conducted  under  traditional  techniques,  a  multi-testing  correction  would  need  to  be 
applied.  However,  as  we  previously  demonstrated  using  the  lUT,  the  probability  of  any  specific  false  positive 
decreases  exponentially  with  the  addition  of  new  datasets. This  is  because  the  probability  of  detecting  the 
same  false  positive  in  two  independent  datasets  is  the  multiple  of  a,  traditionally  0.05.  For  two  datasets  the 
probability  of  the  same  false  positive  being  detected  is  0.0025,  for  three  it  is  0.000125,  and  so  on.  This  can 
compensate  for  even  large  datasets.  In  datasets  with  173,000  variables  (SNP  arrays  used  in  preliminary 
studies),  using  between  4  and  6  independent  datasets  would  eliminate  all  false  positives.  Conversely  if  the 
same  signal  is  being  detected  in  6  datasets  the  probability  that  it  is  due  to  chance  is  of  the  order  1.5x10'®.  Aim 
2:  lUT  is  powerful  new  tool  for  genetic  analysis  and  allows  for  multiple  corrections  without  manipulation  of 
individual  datasets.  We  purpose  to  demonstrate  the  ability  of  lUT  to  detect  complex  genetics  in  a  disease 
phenotype  and  how  combing  lUT  with  ANCOVA  will  allow  the  detection  of  genetic  determinants  and  potentially 
explain  penetrance.  The  non-obvious  advancement  of  this  method  is  that  it  incorporates  information  theory  by 
minimally  altering  the  data  before  analyzing  it.  This  retains  the  maximum  amount  of  information  for  each 
measure.  The  lUT  is  also  not  hampered  by  many  of  the  assumptions  of  other  tests. 

RS  Aim  2  Potential  pitfalls  and  contingencies.  The  lUT  is  dependent  on  having  a  common  variable 
across  all  data  sets  used  in  the  analysis.  This  variable  can  be  very  broad  such  as  dog  breed  or  very  narrow 
such  as  a  molecular  phenotype.  Regardless,  the  lUT  will  only  answer  questions  related  to  the  common 
variable  among  data  sets.  One  way  to  correct  for  that  is  in  the  initial  study  design.  The  study  design  should 
take  into  account  all  of  the  limitations  associated  with  the  various  statistical  tests  a  priori.  As  we  recently 
discussed  in  a  publication,  applying  the  lUT  to  unrelated  data  sets  will  result  in  the  elimination  of  all  signal. 

RS  Aim  3  rationale.  Large  scale  studies  that  use  traditional  GWA  require  large  patient  populations  to 
achieve  adequate  power  (and  have  yet  to  explain  a  significant  portion  of  the  heritability  associated  with  most 
diseases). This  has  serious  pragmatic  and  ethical  implications.^^  It  also  poses  several  experimental  design 
problems  as  independent  irrelevant  variables  -  e.g.,  in  genetics,  population  structure,  can  overpower  the  effect 
of  interest.^®  Manipulation  of  data  by  Principal  Component  Analysis  (PCA)  after  merging,  or  applying 
normalizations,  hinge  on  the  assumption  that  the  interactions  are  linear.  If  the  interactions  are  non-linear, 
applying  these  corrections  can  make  analysis  more  difficult.^®  Aim  3:  We  propose  to  demonstrate  that 
ANCOVA  and  lUT  are  more  powerful  than  the  traditional  techniques  by  identifying  a  study  and  replicating  that 
study  using  fewer  patients  and  demonstrating  that  our  technique  is  able  to  detect  the  same  signal  in  addition 
other  variants  missed  by  the  more  traditional  techniques. 

RS  Aim  3  Genetic  studies  experimentai  pian.  As  we  did  in  Preliminary  Studies  (c.5.,  using  the  same 
lllumina  173,000  SNP  array),  we  will  conduct  GWA  analysis  of  two  complex  traits,  each  with  high  incidence  in 
a  dog  breed.  Mapping  (1)  As  validation  of  a  complex  trait  that  has  been  mapped  using  a  conventional  genetic 
approach  and  published,  we  will  map  osteosarcoma  in  Scottish  Deerhounds  (one  locus  of  dominant  effect  with 
evidence  of  linkage  (Zmax=5.766)).®°  The  original  work  used  a  4-generation  pedigree  where  60  Deerhounds 
were  genotyped  and  the  genotypes  of  70  others  were  inferred,  for  a  total  of  130  dogs.  We  will  replicate  that 
study  using  the  methods  developed  in  this  proposal  to  conduct  GWA  (ANCOVA/lUT  on  B  allele  frequency  data 
and  lUT  on  allele/genotype  data)  on  18  Deerhound  cases  and  18  controls  (i.e.,  three  case-control  batches  of 
six  and  six).  Mapping  (2)  In  order  to  immediately  draw  high  impact  attention  to  our  innovative  approaches,  we 


propose  to  conduct  GWA  of  a  prominent  breed-specific  complex-genetic  condition  with  high  human  relevance 
-  “wobblers”  or  cervical  spondylomyelopathy  in  Doberman  Pinschers  (reported  to  explain  2.5%  of  proportional 
mortality  in  the  breed). We  have  been  collaborating  for  over  a  year  with  Ronaldo  da  Costa,  our  OSD 
colleague  who  is  a  leading  authority  in  this.^^  We  are  currently  conducting  pedigree  analysis  on  -1000 
Dobermans  (showing  strong  evidence  of  heritability;  data  not  shown),  and  have  initiated  collection  of 
blood/DNA  samples.  Using  the  Doberman  wobblers  pedigree,  we  will  select  optimal  informative  dogs  to 
conduct  a  mapping  study  with  18  cases  and  18  controls  (i.e.,  three  case-control  batches  of  six  and  six).  Power 
analysis.  See  c.2.  RS  Aim  1,  end  of  first  paragraph. Fo//ow  up  to  broad  mapping:  depending  on  the 
type/strength  of  the  evidence  and  the  length  of  the  haplotypes,  we  will  conduct  either  fine  mapping  in  related 
breeds  that  share  a  similar  phenotype,  sequence  implicated  haplotypes  using  sequence  capture,  or 
characterize  transposition  events,  structural  variation  or  DNA  methylation  status  (see  PI  (Alvarez)  biosketch, 
which  demonstrates  successful  funding  of  grants  in  this  area  from  NIH,  DoD  CDMRP  and  AKC-CHF).  The  PI  is 
expert  in  genomics  and  sequence  and  evolutionary  biology  analyses  that  will  be  required  to  fully  evaluate 
genetic  variants  and  their  possible  disease  effects. 

RS  Aim  3  Expected  results.  We  predict  that  in  Mapping  (1)  we  will  identify  the  same  locus  published 
previously  (leading  to  refining  the  locus  through  recombination  in  both  breeds),  and  that  we  will  identify  other 
loci  associated  with  osteosarcoma  risk  -  both  SNP  alleles  and  B  allele  frequency  changes  suggestive  of  CNV 
or  of  effects  resulting  in  allele-specific  SNP  genotyping  bias  from  amplification  step  [^®].  As  Deerhounds  are 
relatively  closely  related  to  Greyhounds,  we  also  expect  to  find  some  loci  shared  between  the  two,  which  would 
provide  convincing  replication  of  the  findings  in  our  preliminary  studies.  We  predict  that  in  Mapping  (2)  we  will 
find  wobblers-associated  variants.  For  both  mapping  studies  we  expect  to  identify  loci  that  could  not  have  been 
found  using  conventional  genetic  analyses.  Exampie  1,  in  preliminary  GWA  studies  applying  lUT  to  binary 
genotype  calling  of  the  same  lllumina  SNP  array  data  used  in  c.5.  Preliminary  Studies,  we  identified  a  genome 
wide  significant  locus  that  would  not  have  been  identified  by  conventional  Chi-Square  GWA  analysis  (not 
shown).  Strikingly,  two  of  the  three  case-control  groups  had  increased  frequency  of  the  SNP  allele  associated 
with  high  risk,  but  the  third  group  had  reduced  frequency  of  the  same  allele  associated  with  reduced  risk.  We 
propose  that,  due  to  reversal  paradox  effects  many  such  findings  cannot  be  detected  by  conventional 
GWA.  We  also  expect  to  identify  candidate  genes  (e.g.,  some  osteosarcoma  candidate  haplotypes  have  no 
more  than  one  gene)  and  variants  (e.g.,  through  sequence  capture)  within  association  loci.  Exampie  2,  in 
Preliminary  Studies  we  demonstrate  the  use  of  ANCOVA/lUT  to  identify  continuous  variable  differences  in  B 
allele  frequencies  associated  with  osteosarcoma  risk.  This  would  not  be  possible  with  current  approaches  that 
map  binary  SNP  alleles  (and  cannot  be  detected  indirectly  by  tag-SNPs  in  LD  when  the  variants  are  relatively 
recent).  Such  variation  may  be  indicative  of  genetic  effects  never  before  sampled  genome  wide  for  GWA,  such 
as  CNV  or  isothermal  amplification  bias  [^®]  in  lllumina  Infinium  SNP  genotyping  (e.g.,  due  to  DNA  methylation, 
structural  variation,  and  retrotransposition  events).  If  our  expected  results  materialize,  as  is  strongly  supported 
by  our  preliminary  studies,  they  would  establish  the  superior  power  and  preservation  of  information  in  the 
innovative  experimental  design  and  analyses  we  propose;  and  it  would  open  the  door  to  studying  the  most 
common  (and  with  highest  mutation  rates)  types  of  genetic  variation  for  the  first  time. 

RS  Aim  3  Potential  pitfalls  and  contingencies.  Our  preliminary  studies  support  the  feasibility  of 
applying  very  well-established  statistical  methods  for  novel  biological  data  analyses.  For  example,  applying  an 
lUT  approach  to  GWA  using  binary  genotype  data,  identified  a  SNP  locus  at  genome  wide  significance;  but  no 
locus  reached  significance  using  conventional  Chi-Square  analysis  on  the  same  genotype  data  (see  Example 
1  in  previous  section).  Notably,  others  have  recently  independently  validated  that  same  application  of  lUT.^^  A 
second  example  is  the  fact  that  the  ANCOVA/lUT  mapping  approach  identified  several  loci  that  were  covered 
by  multiple  significant  SNPs,  including  five  SNPs  in  a  600,000  kb  region  of  chr6;  the  odds  of  the  observed 
physical  genome  distribution  being  a  random  effect  are  infinitesimally  low.  The  greatest  challenges  in  the  field 
of  GWA  are  validation  of  association  and  identification  of  causative  mutations.  These  remain  potential  pitfalls 
for  us,  but  we  are  encouraged  by  the  fact  that  our  osteosarcoma  GWA  (using  lUT  of  conventional  binary 
genotypes)  in  Greyhounds  identified  one  (of  19  significant)  SNPs  within  the  4.5  Mb  interval  identified  for 


linkage  to  osteosarcoma  in  the  closely  related  Scottish  Deerhound.  This  ability  to  fine  map  across  related 
breeds  is  one  of  the  major  strengths  of  dogs,  as  are  the  reduced  phenotypic  and  genetic  heterogeneity.'^”  For 
the  mutation  detection,  we  will  be  challenged  as  is  everyone,  but  1)  we  have  improved  chances  over  most 
others  because  we  will  have  more  loci  to  prioritize  for  specific  molecular  approaches  based  on  our  types  of 
findings  (say,  structural  variation  vs.  DNA  methylation),  and  2)  we  have  the  technical  and  computational 
expertise,  and  are  using  the  most  cutting  edge  methodologies. 

C.4.  Software  development 

All  the  algorithms  developed  in  this  project  will  be  integrated  into  an  open  source  R  package  using  R  and 
Bioconductor  functions  and  packages.  The  package  will  be  tested  on  both  stand-alone  workstation  and  also 
parallel  computing  environment  including  two  clusters  available  at  OSD  (one  in  the  Ohio  Supercomputer 
Center,  one  in  the  Dept,  of  Biomedical  Informatics).  The  packages  will  be  released  on  a  project  website  and 
freely  available  to  public.  In  addition,  we  will  submit  it  to  Bioconductor  in  compliance  with  the  testing  and 
inclusion  criteria.  If  time  permits,  we  will  also  consider  integrating  the  R  package  into  a  web  tool  using  web 
interface  tools  such  as  the  Rcgi  package  (a  CGI  WWW  interface  R). 

C.5.  Preliminary  studies  &  Demonstration  of  proposed  experimental  approach 

Note:  To  demonstrate  the  novelty  and  significance,  and  the  experimental  plan  for  all  three  Aims,  we  devote 

significant  space  in  this  proposal  to  describe  our  preliminary  studies  (two  manuscripts  in  preparation).. 

Study  design  (ANCOVA/lUT  approach),  canine  osteosarcoma  (OSA).  Dog  breeds  have  ~1 00-fold 
less  genetic  variation  than  humans.  Greyhounds  were  split  over  one  hundred  years  ago  into  racing  and  show 
sub-breeds  (registered  NGA  and  AKC,  respectively). 

Strikingly,  racers  have  the  highest  OSA  rate  (25% 
incidence)  of  any  breed,  whereas  show  dogs  have  no 
increased  risk.'^^''^^  We  thus  designed  a  study  of  a 
complex  genetic  trait  in  an  outbred  mammal,  but  used 
one  of  the  simplest  such  contexts  possible.  Genotyping 
of  these  dogs  was  performed  using  the  highest  density 
SNP  array  available  in  dogs  (lllumina  HD,  173,000 
feature;  fewer  SNPs  than  humans  due  to  the  highly 
extended  linkage  disequilibrium  (LD)  in  dogs). 

Importantly,  this  genotyping  platform  provides  not  only 
the  presence  or  absence  of  the  binary  A  or  B  alleles  at 
each  marker,  but  also  the  signal  intensity  of  the  marker 
and  the  ratio  of  the  two  alleles  (referred  to  as  B  allele 
frequency,  BAF).  We  conducted  the  SNP  genotyping  in 
three  OSA  positive-negative  (case-control)  groups  in 
order  to  1)  using  ANCOVA  to  adjust  for  group 
membership  as  well  as  potentially  addressing  the  three 
reversal  paradoxes  (Yule-Simpson,  Lord's,  and 
suppression),  which  share  the  characteristic  that  the 
association  between  two  variables  can  be  reversed, 
diminished,  or  enhanced  when  another  variable  is 
statistically  controlled  for  [”'^°];  and  2)  enable  the  use  of 
lUT  in  place  of  GWA  by  Chi-Square  analysis  with 
Bonferroni  multiple  testing  correction.  Specifically,  we 
genotyped  batches  of  12  dogs  in  the  combination  of  4 
OSA  racers,  4  OSA  free  racers  (OFR)  and  4  show 
(AKC). Statistics  &  Resuits:  Data  was  analyzed  using 
lllumina  GS  and  Partek  GS.  Sample  attributes  (incl. 
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Fig  1.  Application  of  ANCOVA.  Correction  of 
Greyhound  osteosarcoma  (OSA)  positive  and 
negative  continuous  variabie  genotypes  (B  aiieie 
frequencies).  (A)  Uncorrected  anaiysis  shows 
popuiation  structure  effects:  separating  OSA  positive 
and  negative  racers  apart  from  negative  AKC  show 
Greyhounds.  (B)  ANCOVA-corrected  anaiysis 
cieaniy  separates  OSA  positive  and  negative  dogs. 


racing/show  and  disease  status)  were  used  to  assign  animals  to  Table  1.  Analysis  of  informative  snps  using 
conditions  for  ANCOVA  corrections.  ANCOVA  is  based  on  regressions 
and  when  used  as  a  statistical  test  assumes  that  covariates  are 
independent  variables.  In  our  ANCOVA  procedure  we  used  it  to  establish 
weighted  averages  so  that  groups  that  are  biologically  similar  have  the 
same  regression  slope.  Linear  models  in  biological  contexts  have  been 
heavily  criticized.  In  this  procedure  a  linear  model  is  entirely  appropriate 
since  we  are  classifying  based  on  known  biological  traits.  Although  this 
does  render  the  measures  arbitrary  it  allows  for  effects  to  be  isolated  that 
can  be  subjected  post  hoc  to  other  tests.  Figure  1  demonstrates  the 
effects  of  ANCOVA  isolation  on  principal  components  associated  with 
the  phenotype  of  interest.  Before  correction,  two  low  risk  groups  (AKC 
and  OFR)  fail  to  cluster  according  to  risk  due  to  population  structure. 

Regression  lines  were  computed  for  the  appropriate  factors  and 
interaction  values  were  transformed  and  weighted  to  correct  for  the  slope 
of  the  generalized  linear  model.  We  next  calculated  the  covariance 
matrix  of  the  loading  values  for  each  dataset  and  conducted  lUT  using  a 
threshold  of  ±0.6.  Many  publications  have  reported  that  Pearson 
correlation  (r)  values  of  0.4  are  biologically  significant.  Here  we  used  0.6 
assuming  it  most  likely  captures  the  most  informative  SNPs. 

A  list  of  potential  candidate  SNPs  from  the  ANCOVA/lUT  was 
identified  and  used  to  filter  genotype  information.  Genotypes  were  subjected  to  a  Chi-Square  test  of 
association  for  osteosarcoma  risk.  Non-significant  genotypes  were  eliminated  from  the  analysis.  Once  only 
SNPs  that  are  loaded  with  the  most  meaningful  measures  remained  we  conducted  t-tests  to  determine  if  they 
were  capable  of  discriminating  between  the  two  training  populations.  This  procedure  revealed  that  the 
osteosarcoma  free  racers  and  the  AKC  show  greyhounds  which  have  below  average  incidence  rate  clustered 
together  and  the  first  principle  component  explained  the  osteosarcoma  risk  variability  initially  masked  by  the 
effects  of  the  population  difference  (Fig.  1B).  We  then  went  on  to  determine  whether  it  was  a  genotypic  effect 
such  as  haplotypes  or  if  some  other  mechanism  was  associated  with  the  differential  risk  in  these  two 
populations.  Intriguingly,  regions  associated  with  altered  risk  could  not  be  identified  based  on  haplotypes 
alone.  However,  the  signal  was  derived  from  alterations  in  B  allele  frequency  that  correctly  categorizing  dogs 
across  unrelated  datasets.  The  genome  wide  significant  hits  are  shown  in  Table  1.  Encouragingly,  several 
regions  are  detected  by  multiple  SNPs  (colored),  including  five  SNPs  in  a  600,000  kb  region  of  chromosome  6. 

Preliminary  studies  conclusions.  Here  we  presented  the  first  GWA  study  of  osteosarcoma  in  any 
organism,  and  reported  approximately  twenty  hits.  Our  approach  showed  how  population  structure  can  affect 
the  ability  to  detect  biologically  relevant  genetic  effects.  In  addition,  this  is  the  first  work  to  detect  genome  wide 
significant  association  signal  using  continuous  variable  genotype  data  (B  allele  ratios)  and  ANCOVA/lUT;  we 
propose  those  loci  are  a  combination  of  CNVs  and  genetic/epigenetic  variants  with  differing  amplification  bias 
[^®]  in  the  SNP  genotyping  protocol.  This  is  consistent  with  Dr.  Nadeau’s  suggestion  that  the  missing  heritability 
may  lie  in  unexplored  genome  regions  or  “in  largely  untested  classes  of  genetic  variation.”''^  Beyond  the 
analysis  shown  here,  we  conducted  a  second  GWA  analysis  of  the  same  data,  but  applying  only  lUT  using 
binary  allele  calls  -  see  c.2.,  RS  Aim  3,  Expected  results  and  Potential  pitfalls  and  contingencies.  That  analysis 
suggested  validation  of  the  study,  as  one  of  19  genome  wide  significant  hits  is  within  the  4.5  Mb  interval  linked 
to  osteosarcoma  in  Deerhounds.  Moreover,  we  identified  SNPs  that  could  not  be  identified  by  conventional 
approaches  due  to  the  reversal  paradoxes. 


ANOVA  for  multiple  categories  of  risk. 
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Application  summary:  We  propose  to  develop  novel  applications  of  validated  statistical  approaches  to  enable 
greatly  improved  analysis  of  continuous-variable  biological  data.  This  and  the  new  applications  of  lUT  will  be 
widely  used  for  genomic  and  integrative  analyses. 
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The  deregulation  of  B  cell  differentiation  has  been  shown  to  contribute  to  autoimmune  disorders,  hematological 
cancers,  and  aging.  We  provide  evidence  that  the  retinoic  acid-producing  enzyme  aldehyde  dehydrogenase  lal 
(Aldhlal)  is  an  oncogene  suppressor  in  specific  splenic  IgGl^/CD19”  and  IgGl''"/CD19’^  B  cell  populations. 
Aldhlal  regulated  transcription  factors  during  B  cell  differentiation  in  a  sequential  manner:  1)  retinoic  acid 
receptor  alpha  (Kara)  in  IgGl^/CD19“  and  2)  zinc  finger  protein  ZJj3423  and  peroxisome  proliferator-activated 
receptor  gamma  {Pparg)  in  IgGl’* *'/CD19^  splenocytes.  In  Aldhlal~'~  mice,  splenic  IgGl^/CD19“  and  IgGf^/ 
CD19’'"  B  cells  acquired  expression  of  proto-oncogenic  genes  c-Fos,  c-Jun,  and  HoxalO  that  resulted  in  splenomeg¬ 
aly.  Human  multiple  myeloma  B  cell  lines  also  lack  Aldhlal  expression;  however,  ectopic  Aldhlal  expression  res¬ 
cued  Rara  and  Znf423  expressions  in  these  cells.  Our  data  highlight  a  mechanism  by  which  an  enzyme  involved  in 
vitamin  A  metabolism  can  improve  B  cell  resistance  to  oncogenesis. 

®  2013  Elsevier  B.V.  All  rights  reserved. 


1.  Introduction 

The  deregulation  of  B  cell  differentiation  has  been  shown  to  play  a 
causal  role  in  autoimmune  disorders,  carcinogenesis,  and  aging  [1].  B 
cells  differentiate  from  B-lymphoid  progenitors  in  the  bone  marrow 
and  progress  through  many  stages  during  differentiation  to  ultimately 
express  B  cell  receptor  (BCR).  Lymphocytes  expressing  surface  IgM 
migrate  to  the  spleen  [2],  where  self-reactive  splenic  B  cells  undergo 
apoptosis;  others  become  responsive  to  T-cell-dependent  and  T-cell- 
independent  antigens.  The  various  B-cell  populations  are  compartmen¬ 
talized  in  different  splenic  zones,  including  red  pulp,  marginal  zone,  and 
white  pulp.  After  pathogen  exposure,  they  complete  differentiation  in 
germinal  centers  [2],  Specific  populations  of  B  cells  can  undergo  alterna¬ 
tive  differentiation.  For  instance,  purified  mouse  splenic  B  cells  respond 


Abbreviations:  Aldhlal,  aldehyde  dehydrogenase  lal;  API,  activator  protein  1  a 
heterodimeric  transcription  factor  formed  by  c-jun  and  c-fos;  c-Fos,  transcription  factor 
encoded  by  the  FOSgene;  c-jun,  protein  encoded  by  c-Jun  gene;  HoxalO,  transcription  fac¬ 
tor  homeobox  protein  alO;  Ig,  immunoglobulin;  Pparg,  peroxisome  proliferator-activated 
receptor;  RA,  retinoic  acid;  Rara,  retinoic  acid  receptor  alpha;  RARE,  retinoic  acid  receptor 
response  element;  Zfp423,  murine  zinc  finger  protein;  Znf423,  human  zinc  finger  protein 
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to  Stimulation  with  cytokines,  anti-CD38,  anti-CD40,  anti-p,  and  retinoic 
acid  (RA)  by  enriching  IgGl  ^  and  CD138^  B  cell  populations  and  genes 
involved  in  the  regulation  of  Ig  somatic  hypermutation  and  class 
switching  [3],  In  these  B  cells,  RA  contributed  to  the  suppression  of 
activation-induced  deaminase  (Aid),  transcriptional  regulators  of  differ¬ 
entiation  (Pax5),  and  neoplastic  transformation  t(9;14)  [3,4],  Oncogenic 
processes  further  diversity  in  B  lymphoma  cells  [5],  The  physiological 
mechanisms  responsible  for  the  formation  of  specific  B  cell  subsets 
have  remained  unexplored. 

The  studies  with  dietary  vitamin  A  (retinol  or  retinyl  esters) 
highlighted  a  possible  role  for  this  pathway  in  specific  B  cell  responses. 
Dietary  vitamin  A  content  influenced  IgA  production  against  T-cell  de¬ 
pendent  and  T-cell  independent  type  2  antigens  at  mucosal  locations 
[6],  Vitamin  A  deficiency  in  the  diet  diminishes  immune  responses 
and  increases  mortality  [7-10],  These  responses  may  be  partially  im¬ 
proved  by  supplementation  with  either  vitamin  A  or  its  metabolite  RA, 
arguing  for  RA  as  a  mediator  of  these  responses.  The  function  of  RA  in 
B  cell  studies  in  vitro  revealed  that  multiple  aspects  of  B  cell  biology 
are  RA-sensitive  [10,11],  RA  accelerated  differentiation  of  a  subset  of 
proliferating  lymphoid  progenitor  cells  into  B  cells  by  targeting  the  on¬ 
cogenes  c-myc  and  cyclin  D3,  cytokines,  and  NFkB,  as  well  as  kinase 
p38/CDl<2  [  12,13],  RA  treatment  also  promoted  differentiation  of  malig¬ 
nant  B  cells,  alone  or  in  combination  with  rosiglitazone,  an  agonist  for 
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the  nuclear  receptor  PPAR7  [14].  The  understanding  of  RA  in  immune 
function  is  incomplete.  For  example,  a  recent  trial  in  Guinea-Bissau 
revealed  a  paradoxically  higher  mortality  in  girls  supplemented  with 
vitamin  A  than  in  placebo  group  [15].  In  this  study  we  dissected  the 
role  of  endogenous  vitamin  A  metabolism  on  gene  regulation  in  B  cells. 

An  increase  in  the  intracellular  RA  concentrations  is  generated  in  re¬ 
sponse  to  various  hormonal,  dietary,  and  inflammatory  stimuli  and  may 
be  considered  as  a  factor  in  endogenous  differentiation  of  B  cell  subsets 
[  1 1 ,1 6[.  RA  is  produced  sequentially.  Alcohol  dehydrogenases  (ADH  and 
SDR/RDH)  oxidize  retinol  to  retinaldehyde,  which  is  dehydrogenated 
into  RA  by  the  members  of  aldehyde  dehydrogenase- 1  family  of  en¬ 
zymes:  ALDHlal,  ALDHla2,  and  ALDHlaB  [17].  The  principal  mecha¬ 
nism  of  RA  action  is  through  the  activation  of  RA  receptors  (RAR).  RA 
binding  to  RAR  induces  its  heterodimerization  with  retinoid  X  receptor, 
and  binding  to  cognate  response  element  (RARE)  sequences  in  the  pro¬ 
moters  of  target  genes  [  18],  In  addition,  RA  regulates  a  plethora  of  signal¬ 
ing  and  transcriptional  pathways  [19],  including  a  Zjp423-dependent 
induction  of  Pparg  expression  [20].  Paracrine  RA  production  by  ALDHl 
enzymes  in  dendritic  cells  plays  an  important  role  in  B  cell  homing  to 
the  mucosa,  and  promotes  IgA  isotype  class  switching  [21,22].  The  role 
of  RA-generating  enzymes  in  B  cells  has  remained  unexplored.  Here, 
we  investigated  the  transcriptional  function  of  ALDHlal  in  murine  B 
cell  subsets  and  in  human  multiple  myeloma  B  cell  lines. 


2.  Materials  and  methods 

2.1.  Reagents 

We  purchased  reagents  from  Sigma-Aldrich  (St.  Louis,  MO)  and  cell 
culture  media  from  Invitrogen  (Carlsbad,  CA)  unless  otherwise  indicated. 
Anti-mouse  antibodies  were:  CD19  from  BD  Biosciences  (San  Jose,  CA) 
and  (i-galactosidase  from  Abeam  (Cambridge,  MA). 


2.2.  Animal  studies 

All  experimental  protocols  were  approved  by  the  Institutional 
Animal  Care  and  User  Committee.  Water  and  regular  chow  (Harlan 
Laboratories,  Indianapolis,  IL)  were  available  ad  libitum  in  all  mouse 
studies. 

Study  1  employed  Tg  RARE-Hspalb/lacZ  (denoted  as  RARE-lacZ)  re¬ 
porter  mice  developed  by  Dr.  J.  Rossant  using  a  transgenic  construct 
containing  3  copies  of  the  32  bp  RARE  placed  upstream  of  the  mouse 
heat  shock  protein  IB  promoter  and  [3-galactosidase  gene  (laeZ)  [23]. 
Female  mice  were  purchased  from  the  Jackson  Laboratory  (Bar  Harbor, 
ME).  Three  RARE-lacZ  and  three  wild-type  C57BL/6J  (WT)  female  mice 
(12-15  weeks  old)  were  fed  regular  chow  throughout  this  study. 

Study  2:  Aldhlal~^~  mice  were  previously  generated  in  the  labora¬ 
tory  of  G.  Duester  [24]  and  characterized  for  their  metabolic  responses 
[20,25,26].  Aldhlal-'-  (n  =  10)  and  WT  (n  =  9)  13-14  month  old 
male  and  female  mice  were  used  for  these  studies.  Mice  were  fed  regu¬ 
lar  chow  diet.  Blood  was  collected  by  cardiac  puncture  in  EDTA- 
containing  tubes.  The  spleens  isolated  from  3  randomly-selected  fe¬ 
males  were  used  for  IgGl ''*/CD19“  and  IgGl''“/CD19'**  B-cell  separation. 
The  remaining  spleens  and  other  organs  were  used  for  other  analyses. 


2.3.  Human  cells 

Leukopacks  (American  Red  Cross,  Columbus,  OH)  were  obtained 
from  healthy  donors  under  an  Institutional  Review  Board-approved 
procurement  protocol.  Peripheral  blood  mononuclear  cells  (PBMC) 
were  cultured  in  RPMl  1640  media  (Invitrogen)  supplemented  with 
10%  fetal  bovine  serum  (ICN  Biomedicals,  Irvine,  CA)  and  kept  at  37  °C 
in  a  5%  C02/air  incubator. 


2.4.  Flow  cytometry  analysis  (FACS) 

Splenocyte  suspension  was  obtained  from  whole  spleens  dissected 
from  4  WT  and  4  Aldhlal^^^  mice  (2  males  and  2  females  in  each 
group).  Briefly,  spleens  were  collected  and  mononuclear  cell  suspen¬ 
sions  were  prepared  by  mechanical  disruption  with  the  aid  of  a  cell 
strainer  (BD  Biosciences,  San  Jose,  CA)  followed  by  brief  incubation  in 
NH4CI  (0.08%)  to  remove  red  blood  cells.  Splenocytes  were  resuspended 
in  fluorescence-activated  cell  sorting  buffer  (FACS;  PBS  containing  0.1% 
BSA  and  0.1%  sodium  azide)  at  5  x  10^/100  pL.  Cells  were  stained  with 
a  panel  of  antibodies  from  AbD  Serotec  (Bio-Rad  Laboratories,  Inc., 
Hercules,  CA)  and  available  isotype  controls.  All  antibodies  were 
primary,  non-conjugated  with  the  exception  of  MHC  class  11  which 
were  directly  conjugated  to  fluorescein  isothiocyanate  (HTC)  and 
Alexa  Fluor  647  (BD  Biosciences),  respectively.  Secondary  antibodies 
of  either  phycoerythrin  (PE;  5  pL)  or  fluorescein  isothiocyanate  (FITC; 
1  pL)  were  purchased  from  AbD  Serotec  and  were  used  at  various  dilu¬ 
tions  (1:10,  1:50,  and  1:100)  [27].  Samples  were  analyzed  using  BD 
Accuri  flow  cytometer  and  analyzed  with  BD  Accuri  Flow  analysis  soft¬ 
ware  (BD  Biosciences). 

2.5.  Purification  oflgGl/CD19^  and  IgCl/CDld^  B  cells 

Splenic  B  cell  subsets  were  obtained  from  four  WT  and  Aldhlal 
female  and  one  male  mice.  They  were  purified  by  automated  magnetic 
cell  separation  (autoMACS,  Miltenyi  Biotec).  The  ceil  suspensions  were 
incubated  with  microbeads  conjugated  with  anti-mouse  GDI  9.  The 
CD19^  and  CD19"^  popuiations  were  separated  by  autoMACS.  The 
CD19“  and  CDIO'^  fraction  was  further  incubated  with  a  biotinylated 
rat  anti-mouse  "yl  (clone  Gl-7.3,  BD  Biosciences)  and  streptavidin- 
conjugated  microbeads.  Populations  of  IgGl‘''/CD19“  and  IgGl***/ 
CD19^  were  separated  by  autoMACS.  Throughout,  we  denoted  these 
populations  as  CD19“  and  CD19^.  The  purity  (>98%)  of  the  cell  popula¬ 
tion  was  confirmed  by  FACS. 

2.6.  RNA  isolation  and  quantitative  real  time  PCR  (qRT-PCR) 

Total  RNA  was  prepared  using  the  RNeasy  kit  (Qiagen,  Valencia,  CA). 
qRT-PCR  was  performed  with  predesigned  assays  (Applied  Biosystems, 
Foster  City,  CA)  using  a  7900HT  Fast  Real-Time  PCR  System,  TaqMan  de¬ 
tection  system,  and  validated  primers  (Applied  Biosystems,  Foster  City, 
CA)  in  triplicate  as  described  [16].  The  mRNA  expression  was  calculated 
based  on  Tata-box  binding  protein  (TBP)  expression  for  normalization 
using  the  comparative  Ct  method. 

2.7.  NanoString  gene  expression  profiling 

The  digital  multiplexed  NanoString  nCounter  mouse  inflammation 
expression  assay  (NanoString  Technologies)  was  performed  with 
100  ng  of  total  RNA  according  to  the  manufacturer’s  instructions.  RNA 
was  isolated  from  CDf9“  and  CDIO**^  fraction  isolated  from  3  female 
WT  and  Aldhlal mice.  NanoString's  nCounter  technology  is  based 
on  direct  detection  of  target  molecules  using  color-coded  molecular 
barcodes,  providing  a  digital  quantification  of  the  number  of  target  mol¬ 
ecules  [28].  Total  mRNA  (5  pL)  was  hybridized  overnight  with  nCounter 
Reporter  (20  pL)  probes  in  hybridization  buffer  and  nCounter  Capture 
probes  (5  pL).  The  hybridizations  were  incubated  at  65  °C  for  16-20  h 
in  excess  of  probes  to  ensure  that  each  target  finds  a  probe  pair.  Excess 
probes  were  removed  using  two-step  magnetic  bead  based  purification 
on  the  nCounter  Prep  Station.  The  hybridization  mixture  containing  tar¬ 
get/probe  complexes  was  allowed  to  bind  to  magnetic  beads  containing 
complementary  sequences  on  the  Capture  Probe  and  washed  followed 
by  a  sequential  binding  to  sequences  on  the  Reporter  Probe.  Biotinylated 
capture  probe-bound  samples  were  immobilized  and  recovered  on  a 
streptavidin-coated  cartridge.  The  abundance  of  specific  target  mole¬ 
cules  was  then  quantified  using  the  nCounter  Digital  Analyzer  to  count 
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the  individual  fluorescent  barcodes  and  assess  target  molecules  present 
in  each  sample  with  a  CCD  camera.  For  each  assay,  a  high-density  scan 
(600  fields  of  view)  was  performed  at  the  highest  standard  data  resolu¬ 
tion,  600  fields  of  view  (FOV)  that  is  the  dynamic  range  and  level  of  sen¬ 
sitivity  in  the  system.  Images  were  processed  internally  into  a  digital 
format  and  were  normalized  using  the  NanoString  nSolver  software 
analysis  tool.  Counts  were  normalized  for  all  target  RNAs  in  all  samples 
based  on  the  positive  control  RNA  to  account  for  differences  in  hybridi¬ 
zation  efficiency  and  post-hybridization  processing,  including  purifica¬ 
tion  and  immobilization  of  complexes.  Subsequently,  a  normalization 
of  mRNA  content  was  performed  using  six  internal  reference  house¬ 
keeping  genes  that  were  included  within  the  mouse  inflammatory 
panel:  Cite,  Gapdh,  Gush,  Hprtl,  Pgkl,  and  Tubb.  The  average  was 


normalized  by  background  counts  for  each  sample  obtained  from  the 
average  of  the  eight  negative  control  counts.  Counts  were  corrected  by 
subtracting  the  mean  and  2  times  standard  deviation  value  of  the  nega¬ 
tive  control  from  the  counts  obtained  for  each  target  RNA. 


2.8.  Immunohistochemistry 

Spleens  and  kidney  were  embedded  in  paraffin.  Immunohistochem- 
ical  analysis  of  spleens  from  WT  and  RARE-lacZ  mice  was  performed 
with  rabbit  polyclonal  (i-galactosidase  antibody  (1:1000  dilution).  Im¬ 
ages  were  obtained  using  Olympus  IVI081  1X50  and  Pixera  Viewfinder 
3.0  software. 
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Fig.  1.  Activation  of  retinoic  acid  receptor  response  element  (RARE)  accompanies  splenic  red  pulp  development,  which  depends  on  Aldtilal.  (A)  RARE  activation  was  studied  in  WT  and 
RARE-LacZmice(n  =  3  fromeachgroup)  which  were  injected  eveiy  48  h,  up  to  a  total  of  3  injections  with  1  ml  PBS  (vehicle)  without  or  with  RA  (500  nIVl).RAwas  added  into  PBS  from 
500  pM  RA  stock  solution  in  ethanol  immediately  before  injection.  Vehicle  PBS  solution  contained  1  pL  ethanol.  All  RA  solutions  were  protected  from  light  and  stored  under  argon  atmo¬ 
sphere.  Total  injected  RA  amount  was  1.5  nmol  per  mouse  (0.15  pg/dose).  Immediately  after  the  third  injection,  mice  were  harvested  and  their  spleens  were  embedded  in  paraffin. 
Immunohistochemistry  was  performed  with  anti-lVgalactosidase  antibody.  Heterogeneous  brown  (5-galactosidase-positive  areas  were  found  in  red  pulp  (RP)  compared  to  follicular 
zone  (F)  (lOx  magnification).  The  RARE  responses  in  adipose  and  hepatic  tissues  were  described  in  [25].  (B)  Weight  of  spleens  (left  panel,  Study  2.  Aldhlal~^~  (n  =  10)  and  WT 
(n  =  9)).  Inset  shows  the  representative  whole  spleen  images  of  WT  and  Aldhla1~^~  mice.  (C)  Representative  hematoxylin  &  eosin  staining  of  paraffin  embedded  spleen  section 
from  WT  and  Aldh1a1~^~  (KO)  mice  from  the  same  study  (n  =  3  per  group).  (D)  FACS  analysis  of  splenocyte  suspension  isolated  from  whole  spleens  of  Aldhlal~^~  (n  =  4)  and  WT 
(n  =  4)  mice.  P,  significance  levels,  Mann-Whitney  U  test.  (E)  Expression  of  germinal  center  markers  in  the  total  spleen  lysates  isolated  from  WT  (white  bars)  and  A\dh\a\~'~  (black 
bars)  mice  was  analyzed  by  TaqMan  assays  (WT:  n  =  3;  Aldhlal~^~  n  =  5).  Data  were  normalized  by  TBP.  Significant  difference  was  determined  using  Mann-Whitney  U  test. 


R.  Yasmeen  eta/.  / Biochimica  et Biophysica  Acta  1833  (2013)  3218-3227 


3221 


2.9.  Transfections 

U266B1  (U266)  and  RPM18226  were  purchased  from  American  Type 
Culture  Collection  (Manassas,  VA),  other  B  cell  lines  were  provided  by 


Dr.  D.M.  Benson,  Jr.  All  human  B  cells  were  maintained  in  15%  fetal 
bovine  serum/RPMl  1640  medium  as  previously  described  [29]. 
Human  full  length  Aldhial  cDNA  expression  vector  was  purchased 
from  OrlGene  (Roclwille,  MD).  U266  cells  (6  x  10®  per  tube)  were 
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Fig.  2.  IncreasedproportionofCD19’^Bcellscontributestosplenomegalyiny4/d/ilal  '  mice. (A) CD19  andCD19’^BcellswereisolatedfromwholespleensofWTandA/d/i7a7  '  mice 
using  automated  magnetic  cell  separation  and  double  selection  with  IgGi  and  CD19  antibodies  (n  =  5  per  group).  Cell  populations  were  quantified.  (B-E)  Expression  of  CD19  (B)  and 
plasma  markers  {CD138)  (C)  as  well  as  differentiation  (CD79a,  D)  and  naive  (B220,  E)  B  cell  markers  were  quantified  in  the  isolated  CD19“  and  CD19“'“  B  cells  using  TaqMan  assays 
(n  =  3  from  each  group).  Data  were  normalized  by  TBP.  Asterisk,  significant  difference  in  expression  between  CD19“  and  CD19’^  B  cells  of  the  same  genetic  background.  Mann-Whitney 
U  test.  (F-I)  Representative  immunostaining  characteristics  (from  n  =  5  per  group)  of  total  splenocytes  (F)  and  isolated  CD19“  and  CD19‘''  B  cells  (G)  using  FACS  analysis.  Total 
splenocytes  and  isolated  CD19“  and  CD19+  B  cells  were  simultaneously  analyzed  with  CD19,  CD138  (H),  and  B220  (I)  antibodies  conjugated  with  different  secondary  fluorescent  anti¬ 
bodies.  For  all  gates,  the  percent  and  total  count  of  all  cells  staining  positive  for  both  antibodies  were  determined. 
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transfected  with  human  full  lengthA/dfilaJ  (PCMV6-XL5,  OrlGene)  or 
empty  vector,  using  the  Amaxa  Cell  Line  Nucleofector  Kit  C  (Lonza, 
NJ).  Transient  transfections  were  performed  in  NIH-3T3  fibroblasts  lack¬ 
ing  Piparg  expression  using  Fugene  (Roche,  South  San  Francisco,  CA)  and 
the  following  vectors:  HoxAlO  luciferase  reporter  vector  (Switchgear 
Genomics,  Menlo  Park,  CA),  control  Renilla  reporter  vector,  and  murine 
full  length  Kara,  Pparg,  and  Rxra  constructs  according  to  a  previous  pro¬ 
tocol  [16,20], 


2.10.  Statistical  analysis 

Oncomine  cancer  transcriptome  database  (https://www.oncomine. 
org)  was  used  as  a  publicly  available  platform  for  data-mining  in 
mRNA-expression  studies  [30].  The  search  terms  ‘Aldhlal  and  multiple 
myeloma’  were  used  to  identify  relevant  studies  with  sufficient  sample 
numbers  to  compare  expression  of  Aldhlal  in  multiple  myeloma  and 
plasma  cells  in  the  same  dataset.  One  such  dataset  was  identified  [31  [. 
All  other  data  group  comparisons  were  performed  using  Mann  Whitney 
U-test  unless  othenvise  indicated,  and  correlations  were  examined  by 
Pearson's  test. 


3.  Results 

3.1.  Vitamin  A  metabolism  regulates  immature  B  cell  populations  in 
the  spleen 

The  topography  of  RAR  activation  in  mouse  spleen  was  assessed  in 
RARE-lacZ  mice  treated  with  and  without  RA  (Fig.  lA).  RARE  activation 
in  the  spleen  was  heterogeneous  and  was  predominant  in  the  red  pulp 
compared  to  lymphoid  follicles  in  both  non-treated  and  RA-treated 
samples. 

RA  is  produced  by  an  ALDHl  family  of  enzymes  ALDHlal,  ALDHla2, 
and  ALDH la3.  InAldhlal~^^  mice,  the  spleen  is  enlarged  (256%,  Fig.  1 B ) 
compared  to  spleens  in  WT  mice.  Spleen  architecture  is  altered  in 
Aldhlal~^~  vs.  WT  mice  (Fig.  1C)  due  to  the  increased  proportion  of 
CD19+  and  B220+  B  cell  populations  (Fig.  ID).  Germinal  center  markers 
MTA3  and  BCL6  [32 [  were  expressed  at  similar  levels  in  Aldhlal~^~  and 
WT  spleens,  indicating  that  they  were  not  impaired  hy  Aldhlal  deficien¬ 
cy  (Fig.  IE).  The  expression  of  Cdldl  was  61%  lower  in  Aldhl al  than 

in  WT  splenocytes.  The  association  of  splenomegaly  with  the  increase  in 
GDI  9+  and  B220+  B  cell  populations  suggests  that  differentiation  could 
be  impaired  in  Aldhl al~^~  mice. 

Differentiated  CD19^  B  cells  are  one  of  the  major  leukocyte  popula¬ 
tions  in  red  pulp.  Among  them,  the  IgGl B  cell  population  was  sensitive 
to  RA  in  pharmacological  studies  [3,4[.  Therefore,  we  used  magnetic  cell 
separation  technology  to  separate  IgGl'''/CD19^  and  lgGl+/CD19^ 
splenic  B  cell  populations  (Fig.  2)  to  test  for  effects  of  Aldhlal  deficiency 
on  transcriptional  regulation  of  critical  immune  pathways  in  B  cells.  We 
termed  IgGl''"/CD19^  as  CD19^  and  IgGl^/CD19^  as  CDIB**"  B  cells 
throughout  the  publication.  The  splenomegaly  seen  in  Aldhl al^'^ 
mice  (Fig.  IB)  was  associated  with  an  increased  number  of  CDIB'*'  B 
cells  (171%)  compared  to  WT  (Fig.  2A).  The  purity  and  characteristics 
of  CDIB''^  B  cell  population  were  examined  using  CD19  expression 
(Fig.  2B)  and  FACS  analysis  (Fig.  2F-G).  Although  both  cell  populations 
expressed  similar  low  levels  of  plasma  cell  marker  CD138  (Fig.  2C), 
both  CDIB***  and  CD19“  B  cell  populations  were  CDl 38-positive  in 
FACS  analysis  (Fig.  2H).  CDIB*'*  B  cells  also  expressed  a  mature  B  cell 
marker  CD79a  (Fig.  2D).  In  contrast,  an  expression  of  pro-,  mature  and 
activated  B  cell  marker  B220  was  lower  in  CDl  9^  than  in  CDl  9^  B 
cells  (Fig.  2E).  Both  groups  were  B220  positive  in  FACS  analysis 
(Fig.  21).  Notably,  the  expression  of  all  major  studied  B  cell  markers 
was  similar  in  WT  and  Aldhl al^^^  mice.  However,  Aldhlal  deficiency 
was  associated  with  an  increase  in  the  CD19^  B  cell  population  and 
splenomegaly. 


3.2.  Dissimilar  expression  of  Aldhl  in  CD19  and  CDl  9^  B  cell  populations 

To  investigate  whether  CD19“  and  CD19^  B  cells  metabolize  vitamin 
A,  we  examined  the  expression  of  enzymes  involved  in  synthesis  of  Raid 
(Fig.  3A)  and  RA  (Fig.  3B).  The  expression  of  major  Raid-generating  en¬ 
zymes  (RdhlO,  Adh4)  was  similar  between  CDl 9“  and  CD19'*'  B  cells 
(Fig.  3A,  left  panel).  Aldhlal  deficiency  moderately  decreased  RdhlO 
levels  (  —  24%,  Fig.  3A,  right  panel).  In  contrast,  expression  of  the  RA- 
generating  Aldhlal  and  Aldhl a2  enzymes  was  markedly  reduced  to 
6.4%  and  18%,  respectively,  in  CDl  9^  compared  to  CDl  9“  B  cells 
(Fig.  3B  left  panel).  Aldhlal  was  the  predominantly  expressed  member 
of  ALDHl  family  of  enzymes  in  both  CDl  9^  and  CDl  9^  B  cells 
(Fig.  3B,  left  panel).  Aldhlal  deficiency  suppressed  the  expression  of 
Aldhla2  in  CD19^  and  CD19+  B  cells  (Fig.  3B,  right  panel).  Thus, 
CDIB'*'  B  cells  in  Aldhlal~^~  mice  had  reduced  levels  of  all  RA- 
producing  enzymes.  The  change  in  Aldhlal  expression  also  influenced 
expression  of  Rara,  the  primary  transcription  factor  regulated  by  RA 
[18[.  Rara  was  reduced  to  44%  in  CD19**"  vs.  CD19“  B  cells  in  WT  mice 
(Fig.  3C,  left  panel).  In  Aldhlal~^^  CD19^  B  cells,  Rara  expression  was 
decreased  to  60%  compared  to  WT  CD19“  B  cells  and  became  similar 
to  that  seen  in  CDl 9+  B  cells  (Fig.  3C,  right  panel). 

3.3.  Aldhlal  deficiency  results  in  oncogene  expression  in  CD19^  B  cells 

To  identify  the  mechanisms  altering  the  number  and  properties  of 
CD19^  B  cells  in  Aldhlal^^~  mice,  we  analyzed  the  classic  markers  of 
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Fig.  3.  Diminished  expression  of  RA-generating  Aldh  1  enzymes  and  Rara  in  B  cells  from 
A\dMa\~'~  vs.  WT  mice.  (A-C)  CD19“  and  CD19“'"  B  cells  (same  as  in  Fig.  2A,  n  =  5  per 
group)  isolated  from  WT  (left  panels)  and  Aldhla1~^~  (right  panels)  mice  were  examined 
for  the  expression  of  (A)  major  retinaldehyde  (Rald)-generating  enzymes  RdhlO  and  Adh4, 
(B)  RA-generating  enzymes  Aldhlal,  Aldhl a2,  and  Aldhla3,  and  (C)  Rara.  Gene  expression 
was  analyzed  by  TaqMan  assays  and  normalized  with  TBP.  P,  significant  difference  in  ex¬ 
pression  between  CD19“  and  CD19+  B  cells.  Mann-Whitney  U  test  (throughout  this 
figure). 
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B  cell  apoptosis  (caspase  3),  proliferation  (IVIki67)  (Fig.  4A),  and  differ¬ 
entiation  (Ebfl  and  Pax  5)  in  all  groups  (Fig.  4B).  The  expression  of 
these  genes  was  not  altered  between  A\dMa\^'~  vs.  WT  genotype  in 
the  isolated  CD19^  and  CD19'*"  B  cells.  Ebfl  and  PaxS  expressions  were 
higher  in  differentiated  CD19'''  than  in  CD19“  splenocytes;  however, 
these  levels  were  not  influenced  by  WT  and  Aldhlal^^~  genotypes. 
To  identify  genes  increasing  B  cell  population  in  Aldhlal^^~ 

mice,  we  quantified  the  expression  of  250  inflammatory  genes  using 
NanoString  Technologies'  nCounter  System.  Gene  expression  was  nor¬ 
malized  using  six  housekeeping  genes.  The  gene  cluster  analysis  re¬ 
vealed  that  Aldhlal  influenced  expression  of  proto-oncogenes  (c-Fos, 
c-Jun,  and  Mafic  kinase)  and  an  opsonin  (Clqb)  (Fig.  4C).  Aldhlal^^^ 
CD19^  B  cells  expressed  267%  higher  levels  of  c-Fos  than  WT  cells 
(Fig.  3D).  These  changes  were  in  agreement  with  decreased  expression 
of  Kara,  a  known  suppressor  of  c-Fos,  in  CD19^  B  cells  (Fig.  3C)  [33]. 
Aldhlal  deficiency  affected  CDIB'**  B  cells  more  than  CD19~  B  cells. 
Specifically  c-Fos  and  c-Jun  expression  levels  were  711%  and  294% 


higher  than  those  in  WT  cells  (Fig.  4D).  This  finding  was  paradoxical  be¬ 
cause  CD19“  B  cells  expressed  15.5-times  higher  A/dfil a  1  and  5.5-times 
higher  Aldhla2  levels  than  those  seen  in  CD19^  B  cells  from  WT  mice 
(Fig.  3B).  We  hypothesized  that  another  Aldhlal -sensitive  suppressor 
of  c-Fos/c-Jun  is  active  in  GDI  9'*"  B  cells. 

3.4.  Aldhlal  limits  oncogene  expression  CD19^  B  cells  by  a  sequential 
induction  ofZfp423  and  Pparg 

ALDHl  enzymes  can  induce  the  transcription  factor  Zfp423  which, 
in  turn,  controls  the  expression  of  the  anti-proliferative  and  anti¬ 
inflammatory  transcription  factor  Pparg  in  adipocytes  [20],  In  splenic  B 
cells,  expression  of  Aldhlal  positively  correlated  with  Zfp423,  specifi¬ 
cally  in  CD19^  B  cells  (P  <  0.001)  (Fig.  5A).  Zfp423  was  markedly  up- 
regulated  in  CD19^  (625%)  vs.  CD19^  WT  B  cells.  However,  this  in¬ 
crease  was  abolished  in  Aldhlal^^^  CD19+  B  cells  (Fig.  5B),  suggesting 
a  regulatoiy  role  of  Aldhlal.  The  Zfp423  expression  levels  were  also 
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Fig.  4.  Increased  expression  of  proto-oncogenic  genes  in  isolated  splenic  CDl  9“  and  CDl  9“^  Aldh  1  a  1  vs.  WT  B  cells.  CD!  9“  and  CD19’''  B  cells  were  isolated  from  spleens  of  the  same  WT 
(white  bars)  an6Aldhlal~'~  (black  bars)  group  of  mice  (Fig.  2,  n  =  5  per  group).  (A&B)  Expression  of  apoptosis  (caspaseS)  and  proliferation  markers  (IVIki67)  (A)  as  well  as  B  cell  dif¬ 
ferentiation  markers  (Ebfl  and  PaxS)  were  semi-quantified  using  TaqMan  assays.  Data  (n  =  3  per  group)  were  normalized  by  TBP.  Asterisk,  significant  difference  in  expression  between 
CD19“  and  B  cells  of  the  same  genetic  background.  Mann-Whitney  U  test.  (C)  Selected  expression  heat  maps  (red  and  green  colors  represent  high  and  low  expression  levels,  re¬ 

spectively)  obtained  using  NanoString  Technologies’  nCounter  mouse  inflammation  panel.  The  nCounter  GX  Mouse  Inflammation  Kit  (NanoString  Technologies)  consists  of  184 
inflammation-related  genes  and  sbc  internal  reference  genes  (www.nanostring.com).  Red  boxes  showed  statistically  significant  (n  =  3  per  group,  P  <  0.05,  Mann-Whitney  U  test)  clusters 
of  genes.  (D)  Expression  levels  of  c-Fos  and  c-jun,  using  NanoString  Technologies'  nCounter  mouse  inflammation  panel  insets  show  the  extracted  expression  heat  maps  for  c-Fos  and  c-Jun. 
P,  significant  difference  between  WT  and  Aldhlal~^~  B  cells,  Mann-Whitney  U  test  (n  =  3  per  group). 
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correlated  with  Piparg  levels  in  CD19^  B  cells  (Fig.  5C).  In  agreement 
with  Zfp423's  role  in  the  induction  of  Pparg  [20],  only  CDig***  B  cells 
expressed  less  Pparg  (  —  75%)  in  Aldhlal^^^  than  WT  splenocytes 
(Fig.  5D).  PPAR  response  element  (PPRE)  has  been  identified  in  the  pro¬ 
moter  of  transcription  factor  HoxaW  [34],  a  key  inducer  of  human 
lymphomyelopoiesis  [35].  To  examine  a  possible  role  for  Pparg  in  the 
regulation  of  the  HoxalO  promoter,  we  performed  transfection  studies 
in  NlFl  3T3  fibroblasts,  a  cell  line  lacking  endogenous  Pparg  expression. 
Forced  expression  of  fparg  inhibited  activation  of  HoxalO  promoter  in  a 
gene  dose-dependent  manner  (Fig.  5E).  Pparg  expression  also  markedly 
suppressed  HoxalO  promoter  activation  in  a  ligand  (rosiglitazone)-de- 
pendent  manner  (Fig.  5F).  In  contrast,  both  Rxra  and  Kara  only  moder¬ 
ately  activated  HoxalO  promoter  reporter  in  the  presence  or  absence 
of  RA  ligand  (Fig.  5G).  Consistent  with  a  regulatory  role  of  Pparg, 
CD19^  B  cells  expressed  400%  higher  levels  of  HoxalO  in  Aldhlal~^~ 
than  in  WT  mice  (Fig.  5F1).  Thus,  elevated  HoxalO  expression  in 
Aldhlal^^^  CD19^  B  cells  could  be  a  direct  effect  of  deficient  Pparg  ex¬ 
pression.  Other  transcriptional  mechanisms  may  also  regulate  HoxalO 
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in  CD19^  B  cells.  Since  elevated  expression  of  proto-oncogenes  API 
(c-fos/c-jun)  and  HoxalO  is  involved  in  the  development  of  multiple  my¬ 
eloma  (MM),  we  examined  the  Aldhlal  expression  in  B-cell  related 
cancers. 

3.5.  RA  and/or  Aldhlal  rescues  Kara  and  Zfp423  expressions  in  myeloma 
B  cell  lines 

A  search  of  publicly-available  cancer  gene  expression  data  using 
Oncomine  analysis  revealed  that  B-cell-related  cancer  cell  lines  express 
low  levels  of  Aldhlal  compared  to  other  cancer  cell  lines  (Fig.  6A).  In 
our  studies  in  peripheral  blood  mononuclear  cells  isolated  from  healthy 
donors  (Fig.  6B),  Aldhlal  was  a  predominantly  expressed  gene  from  the 
ALDHl  family.  This  pattern  of  expression  was  markedly  changed  in  MM 
cell  lines  U266B1,  RPMI8266,  OPM2,  L363.  and  MMls  (Fig.  6B).  L363 
MM  cells  expressed  no  Aldhl  genes.  Aldhlal  expression  was  lower 
compared  to  the  expression  of  Aldhl  a2  and  Aldhl  a3  in  these  MM  cells. 
In  agreement,  the  Oncomine  analysis  findings  [31]  showed  reduced 
Aldhlal  expression  in  74  human  MM  patients,  compared  to  healthy 
plasma  cell  controls  (Fig.  6C).  The  RA  stimulation  of  L363  MM  cells  in¬ 
creased  expression  of  Znf423,  a  human  analog  of  murine  Z}p423  in 
MM  cells  (Fig.  6D).  Aldhlal  overexpression  was  even  more  effective 
in  up-regulating  suppressors  of  proto-oncogenes.  The  expression  of  a 
full-length  human  Aldhlal  construct  increased  Aldhlal  expression  in 
U266  MM  cells  without  altering  expression  of  Aldhl a2  and  Aldhl a3 
(Fig.  6E).  This  overexpression  of  Aldhlal  resulted  in  an  increased  ex¬ 
pression  of  both  Rara  (30%)  and  Znf423  (  500%)  in  U266  MM  cells 
(Fig.  6F). 

4.  Discussion 

Flumoral  immune  responses  are  mediated  by  mature  follicular  B 
cells  with  the  help  of  T  cells  in  splenic  germinal  centers  and,  to  a 
minor  extent  (-10%),  by  marginal-zone  B  cells  [2[.  Cytokines/cytokine 
receptors,  Ig  recognition,  and  antigen  presented  by  APCs,  dendritic 
cells,  and/or  macrophages  can  initiate  differentiation  of  B  cells  and  for¬ 
mation  of  germinal  centers  to  achieve  Ig  production  [11[.  In  these  pro¬ 
cesses,  dietary  vitamin  A  or  RA  can  facilitate  differentiation  by  classic 
Pax5-dependent  pathways  in  some  splenic  B  cell  population  [11,12]. 
Our  study  revealed  a  key  role  for  RA-generating  ALDHl  enzymes  in  B 
cell  biology.  Aldhlal  expression  in  immature  CD19^  B  cells  and  MM  B 
cells  is  critical  for  the  establishment  of  a  transcriptional  profile  (Fig.7) 
that  prevents  oncogene  expression. 


Fig.  5.  Aldhlal  influences  expression  o{ZJp423  and  Pparggenes  in  CD19+  B  cells,  suppress¬ 
ing  promoter  activity  of  HoxalO.  (A  &  B)  Correlation  (A)  between  Aldhlal  expression 
(Aldhlal  expression  was  shown  in  Fig.  3B)  and  expression  levels  of  Zfp423  (B)  in  CD19 
“  (white  bars  or  circles)  and  CD19+  (black  bars  or  circles)  B  cells.  Gene  expression  was  an¬ 
alyzed  in  triplicate  by  TaqMan  assays  (n  =  3  per  group).  (C  &  D)  Correlation  (C)  between 
Zfp423  expression  (B)  and  expression  levels  of  Pparg  (D)  inCD19“  (white  bars  or  circles) 
and  CD19+  (black  bars  or  circles)  B  cells  (n  =  6  in  each  correlation  group).  Gene  expres¬ 
sion  was  analyzed  by  TaqMan  assays  in  triplicate.  Asterisk,  significant  difference  in  expres¬ 
sion  between  CD19^  and  CD19+  B  cells  of  the  same  genetic  background;  P,  significant 
difference  between  WT  and  Aldhlal^'^  B  cells,  Mann-Whitney  U  test.  (E-G)  Promoter 
analysis  of  HoxalO  in  N1H3-3T3  fibroblasts  lacking  Pparg  (n  =  12).  (E)  N1H3-3T3  fibro¬ 
blasts  were  transiently  transfected  with  full-length  Pparg  overexpression  or  empty 
(mock)  vector  (0).  48  h  after  transfection  the  expression  of  HoxalO  was  measured  by 
TaqMan  assay.  P,  Pearson  correlation.  (F)  Promoter  analysis  of  HoxalO  in  N1H3-3T3  fibro¬ 
blasts  transiently  transfected  with  mock  or  full-length  Pparg  overexpression  vectors.  24  h 
after  transfection,  cells  were  stimulated  with  different  rosiglitazone  concentrations  for 
1 5  h  (n  =  3  in  each  stimulated  group).  #.  significantly  different  between  cells  expressing 
mock  and  Pparg  vector;  asterisk,  significant  difference  between  Pparg  expressing  cells 
stimulated  with  vehicle  and  rosiglitazone.  (G)  Cell  was  transiently  transfected  with 
empty  vector  (Mock)  or  Rara  and  Rxra  overexpression  vectors  (n  =  9).  24  h  after  trans¬ 
fection,  cells  were  stimulated  with  different  retinoic  acid  (RA,  n  =  3)  concentrations  for 
27  h.  #,  significantly  different  between  cells  expressing  empty  vector  and  Rara  and/or 
Rxra;  asterisk,  significant  difference  between  Rara/Rxra  expressing  cell  stimulated  with  ve¬ 
hicle  (ethanol/DMSO,  50/50%)  and  RA,  Mann-Whitney  U  test.  (H)  Expression  levels  of 
HoxalO  in  CD19“  (white  bars  or  circles)  and  CD19+  (black  bars  or  circles)  B  cells  mea¬ 
sured  by  TaqMan  assay  (n  =  3  for  each  group).  Significance  was  determined  as  in  (D). 
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Fig.  6.  Impaired  A/d/i la  1  expression  in  human  multiple  myeloma  B  cell  lines  could  be  rescued  by  RAor  Aldhlal  overexpression  that  increased  Kara  and  Zn/423  levels.  (A)  Relative  expres¬ 
sion  of  Aldhlal  in  hematological  (n  =  3)  and  other  cancer  cell  lines  (n  =  28).  Data  were  obtained  from  data  mining  in  Oncomine  database  and  based  on  the  publication  by  Rhodes  et  al. 
[30].  (B)  Expression  levels  of  A/dhl  genes  were  measured  in  human  PBMC  cells  (n  =  7  donors)  and  in  myeloma  cell  lines  (n  =  5):  RPMI,  U266B1,  OPM2, 1363,  and  MMIs  (n  =  3  per 
measurement).  P,  significant  difference  in  expression  between  Aldhlal,  Aldhla2,  and  AldhlaS  enzymes  within  the  same  cell  population.  (C)  Analysis  of  Aldhlal  gene  expression  in 
human  multiple  myeloma  (n  =  74),  plasma  cells  (n  =  37)  and  monoclonal  gammopathy  of  undetermined  significance  (MGUS)  (n  =  5),  Gene  expression  database  analysis  (see  Mate¬ 
rials  and  methods)  identified  by  Zhan  et  al.  (2002)  [31  ]  data  shown  here  (adapted  from  Oncomine;  Rhodes  et  al.  (2007)  [30]).  The  plot  boxes  are  lined  at  lower,  median  and  upper  quartile 
score  values;  whiskers  extend  to  10th  and  90th  percentiles;  dots  mark  minimum  and  maximum  values.  (D)  RA  treatment  of  L363  myeloma  cell  lines  increases  Zn/423  expression.  1363 
cells  were  maintained  in  RPMI  medium  containing  1%  of  UV  treated  FBS,  which  is  depleted  of  retinoids.  1363  cells  were  maintained  in  this  medium  24  h  prior  to  RA  stimulation  and  during 
treatment  with  RA.  Znj243  expression  was  measured  48  h  after  RA  treatment  (n  =  3).  P,  significant  difference,  Mann-Whitney  U  test.  (E)  Expression  levels  of  Aldhlal  (left  panel)  and 
Aldhla2  and  AldhlaJ  (black  and  white  bars  in  the  right  panel)  in  U266B1  cells  transiently  transfected  with  empty  (— )  or  human  full  length  Aldhlal  overexpression  plasmid  (-I-) 
(n  =  3  independent  experiments).  Expression  levels  were  measured  in  triplicate  using  TaqMan  assays  24  h  after  transfection.  P,  significant  difference,  Mann-Whitney  U  test  (F)  Expres¬ 
sion  of  Kara  (left  panel)  and  Zn/423  (human  analog  of  mouseZ/p423,  right  panel)  in  U266B1  (black  bar)  transfected  with  empty  (— )  and  Aldhlal  overexpression  vector  (-I-)  (n  =  3  in¬ 
dependent  experiments).  P,  significant  difference,  Mann-Whitney  U  test 


Aldhlal  expression  was  consistently  predominant  over  other  RA- 
generating  enzymes  from  the  ALDHl  famiiy  in  heaithy  B  cells  in  mice 
and  humans  (Figs.  3B  and  6B).  A/dhIal-dependent  pathways  in  isolated 
B  cell  populations  were  different  from  those  altered  by  administration  of 
RA  or  manipulation  of  dietary  vitamin  A  content.  RA  administration  has 
two  major  sites  of  action  related  to  B  cell  functions.  It  improves  antigen 
presentation  and  IgA  production  at  mucosai  sites  [15]  and  induces  proiif- 
eration  and  differentiation  of  IgGl'*'  spienocytes  in  germinal  centers 
[3,4,11].  In  these  scenarios,  endogenous  RA  was  produced  byAPCs  and 
stimuiated  final  B  cell  differentiation  in  germinai  centers  [36].  We 


showed  that  under  physiological  conditions,  intense  endogenous  RAR 
activity  was  associated  with  red  pulp  (Fig.  lA).  In  agreement,  we  found 
the  highest  A/dfi  la  I  and  Rara  expression  leveis  in  lgGl'^CD19“  (CD19 
~)  B  ceil  populations  (Figs.  2,  3).  CD19~  is  a  potentiaily  heterogeneous 
population  comprised  of  naive  and  mature  B220^  B  cell  populations. 
Both  Rara  and  Aldhl  expressions  were  decreased  in  the  IgGl''"CD19+ 
(CD19'^)  population.  This  loss  of  endogenous  RA  production  in  CD19^ 
B  ceils  could  later  allow  them  to  receive  a  paracrine  signaling  of  RA- 
producing  APCs  after  they  enter  germinal  centers.  The  physiological  ex¬ 
pression  of  Aldhl  genes  in  CD19^  vs.  CD19“  B  cells  in  WT  mice 
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Fig.  7.  Schematic  diagram  of  the  Aldhlal  dependent  pathways  in  CD19  and  CD19+ 
IgGl+  B  cell  populations  in  the  spleen.  Aldhlal  is  expressed  at  higher  levels  in  CD19“ 
vs.  CD19+  B  cells.  Aldhlal  also  shapes  the  expression  of  transcription  factors  in  CD19+ 
population,  where  it  is  responsible  for  the  induction  of  Zfp423  and  Pparg.  Expression  of 
Pparg  suppressed  proto-oncogenes  c-Fos,  c-Jun,  and  HoxalO. 


prevented  increase  in  expression  of  oncogenes  c-Fos/c-Jun  and  HoxalO 
(Figs.  4D,  5H),  due  to  the  /\/dh3aI-dependent  induction  of  oncogene 
suppressors  {Zfp423  and  Pparg)  in  these  cells  (Fig.  5B,  D).  The  disruption 
of  Aldhlal  regulation  in  Aldhlal  mice  markedly  compromised  B  cell 
oncogene  profiles  during  differentiation  (Figs.  4,  5),  increased  red  pulp 
proportions,  and  reduced  expression  of  Cdldl  that  is  involved  in  inter¬ 
actions  between  T  and  B  cells  (Fig.  ID,  right  panel). 

Classic  CD19^  and  CD19^  B  cell  differentiation  via  Ebfl  and  Pax5  [37] 
appear  to  be  not  impaired  in  the  absence  of  Aldh lal  ( Fig.  4B ) .  The  break¬ 
through  in  the  understanding  of  the  mechanism  increasing  CD19^  pop¬ 
ulation  came  from  the  NanoString  analysis  (Fig.  4C).  Aldhlal  deficient 
CD19^  B  cells  expressed  proto-oncogenes  c-Fos/c-Jun  forming  the  tran¬ 
scription  factor  API.  Elevated  c-Fos/c-Jun  and  also  HoxalO  expressions 
are  distinct  features  of  B  cell-dependent  hematological  neoplasms,  in¬ 
cluding  lymphomas  and  multiple  myeloma  [38-40],  HoxalO  overex¬ 
pression  in  hematopoietic  cells  Is  sufficient  to  impair  murine  and 
human  lymphomyelopoiesis  and  leads  to  acute  myeloid  leukemia 
[35,41].  Notably,  all  five  human  multiple  myeloma  (MM)  cell  lines 
expressed  100-tlmes  less  Aldhlal  than  normal  PBMC  cells  (Fig.  6B). 
Similar  findings  In  B  cell  cancers  and  In  plasma  cells  from  MM  patients 
were  available  from  other  studies  [31  [  that  we  identified  through  data¬ 
base  analysis  (Fig.  6A,  C).  The  major  RA-generating  enzyme  in  these  can¬ 
cer  cells  was  Aldhl  a2  suggesting  that  the  loss  of  Aldh  1  a  1  contributed  to 
B  cell  neoplastic  pathology.  Previous  studies  showed  that  only  combined 
treatment  of  RA  and  rosiglitazone  can  induce  U266  differentiation  [14], 
Our  data  provide  a  mechanism  and  rationale  for  the  treatment  of  MM  B 
cells  lacking  A/dfila 3  with  RAR  and  PPAR'y  agonists. 

Increased  c-Fos  expression  in  Aldhlal^^^  CD19^  B  cells  appears 
to  be  consistent  with  the  known  competitive  relation  between  API 
and  RARa  activated  by  RA  [33],  Indeed,  Aldhlal~^~  CD19“  B  cells 
had  reduced  expression  of  both  Aldfi3a3  and  Kara  (Fig.  3).  Forced  ex¬ 
pression  of  Aldhlal  In  U266  cells  can  readily  Increase  Kara  expres¬ 
sion  (Fig.  6).  An  unexpected  result  of  our  study  was  the  finding  of 
more  profound  transcriptional  changes  In  CD! 9^  compared  to 
CD19“  B  cells  expressing  markedly  less  Aldhlal  In  WT  mice 
(Fig.  3B).  WTCD19***  cells  expressed  higher  levels  of  proto¬ 
oncogenes  c-Fos,  c-Jun,  and  HoxalO  than  Aldhlal^^^  CD19^  B 
cells.  Increased  number  of  CD19'*'  cells  contributed  partially  to  the 
splenomegaly  In  Aldhl al~^~  mice.  This  phenomenon  could  be 
based  on  Ald/i 3a 3-medlated  changes  on  the  transcriptome.  During 
adipogenesis,  Aldh  1  a  1  Induces  expression  of  the  transcription  factor 
Zfp423,  which  in  turn  induces  Pparg  [20,42],  Previous  investigations 
highlighted  competition  between  Pparg  and  c-Fos/c-Jun  without  en¬ 
gaging  PPRE  [43,44],  PPRE  response  element  was  found  in  the 
Ffoxa 30  promoter  [34[.  Our  studies  connected  A/d/T3a3  to  the  regula¬ 
tion  of  all  these  transcription  factors  in  B  cells  (Fig.7).  Aldhlal  did 


not  support  Zfp423  expression  in  CD19^  B  cells  probably  due  to 
the  specific  transcriptional  environment.  It  is  possible  that  high 
Aldhlal  levels  In  CD19^  B  cells  produced  RA  for  paracrine  signaling 
that  induced  Zfp423  in  CD! 9^  B  cells.  These  regulatory  mechanisms 
remain  to  be  investigated  in  the  future.  We  found  that  Aldhlal  is  re¬ 
quired  for  the  Zfp423  induction  in  differentiating  CD19^  population 
(Fig.  5B).Zfp423  and  Pparg  levels  were  higher  in  CD19+  B  cells  In  WT 
compared  to  Aldh lal^^^  mice.  This  link  was  suggested  by  a  signifi¬ 
cant  correlation  in  CDIB*^  B  cells  (Fig.  5C).  The  causative  link  be¬ 
tween  Aldhlal  and  Zfp423  expressions  was  demonstrated  in 
human  U266  MM  cells.  Aldhlal  expression  rescued  Znf423  (human 
analog  of  mouse  Zfp423)  in  U266  MM  cells  (Fig.  6F).  RA  treatment 
of  MM  cells  can  also  rescue  Znf423  expression  (Fig.  6D),  suggesting 
that  ALDHlal  acts  in  part  via  autocrine  RA  generation.  Since  Pparg 
is  induced  by  Znf423,  the  short  (24  h)  transfection  period  was  not 
sufficient  to  also  observe  significant  increase  in  Pparg  expression. 
However,  the  relationship  between  Zfp423  and  Pparg  has  been  wide¬ 
ly  documented  [20,42],  Pparg  induction  by  Aldhlal  appears  to  be  a 
critical  event,  because  Pparg  was  an  effective  suppressor  of 
HoxalO,  a  key  transcription  factor  perturbing  myeloid  and  lymphoid 
differentiation  in  mice  and  humans  [34,35,41].  Pparg  inhibited  the 
promoter  of  HoxalO,  while  Kara  and  Rxra  only  modestly  regulated 
this  transcription  factor  (Fig.  5F,  G).  Consequently,  HoxalO  was  up- 
regulated  in  Aldhlal~^~  CDIB***  B  cells  expressing  less  Pparg.  Stud¬ 
ies  investigating  the  effects  of  vitamin  A-deficient  diets  reported 
splenomegaly  and  an  increase  in  plasma  IgGl  levels  in  mouse 
models  of  autoimmune  disorders  [45],  whereas  the  production  of 
specific  IgGl  antibodies  in  the  immunized  mice  was  impaired  [46[. 
Multiple  mechanisms  have  been  proposed  to  explain  these  phenom¬ 
ena,  including  RA-dependent  production  of  IFN7  from  T  cells  [47] 
and  dendritic  CD103^  cell  subsets  [48],  but  the  role  of  B  cells  was 
unclear.  Our  data  highlight  an  autocrine  and/or  paracrine  ALDH! 
function  in  differentiating  B  cells  that  regulates  two  key  transcrip¬ 
tional  oncogene  suppressors  Rara  and  Pparg.  This  suggests  a  gene- 
environment  paradigm  for  early-stage  deregulation  of  oncogene 
profiles  In  IgGl^  B  cell  subsets  through  compromised  vitamin  A 
metabolism. 

5.  Conclusion 

Our  findings  showed  the  critical  role  of  the  retinoic  acid-generating 
ALDHlal  enzyme  in  the  sequential  induction  of  oncogene  suppressors 
Rara  in  IgGl  +/CD1 9^  B  cells  and  Zfp423/Pparg  in  IgGl  "'■/CDl  9**"  B  cells 
during  B  cell  differentiation.  In  the  absence  of  these  suppressors,  B 
cells  acquire  oncogene  Apl  and  HoxalO  expressions  that  lead  to 
lgGl^/CD19^  B  cell  expansion  and  splenomegaly.  Reduced  expression 
of  Aldhlal  and  oncogene  suppressors  Rara,  Zfp423,  and  F^arg  is  a  char¬ 
acteristic  property  of  malignant  human  multiple  myeloma  B  cells.  Im¬ 
portantly,  ectopic  expression  of  Aldhlal  or  RA  effectively  rescues  Rara 
and/or  Zfp423  expression.  The  understanding  of  the  role  of  ALDHlal 
in  B  cell  differentiation  can  shed  light  on  the  early  stages  in  the  develop¬ 
ment  of  malignant  hematological  disorders  and  may  lead  to  the  devel¬ 
opment  of  novel  therapeutics. 
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Introduction 

The  purpose  of  this  proposal  is  to  provide  insight  into  gene  environment  interactions.  It  leverages  the  simplified  genetics 
and  detailed  records  of  the  military  working  dog  population.  There  are  several  critical  aspects  to  meeting  the  aims  of  this 
proposal.  1)  development  of  data  driven  selection  criteria,  2)  biological  sampling  of  representative  dogs,  and  3)  generation 
of  mathematical  methodologies  capable  of  handling  heterogenous  data  and  statistical  tests  in  consistent  manner  and 
providing  clear  and  understandable  results  that  are  biologically  valid.  Here  we  provide  a  breakdown  of  the  previous  year’s 
work  and  document  our  progress  towards  achieving  the  specific  aims  we  proposed.  While  the  overall  progress  of  this 
project  is  summarized  in  the  Annual  Report  by  Dr.  Carlos  Alveraz  (Lead  PI  from  NCHRl),  here  are  the  tasks  in  which  1 
(Huang  from  OSU)  have  engaged  in. 
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Body 


Task  1-  Regulatory  Approval: 

i)  Cooperative  Research  And  Development  Agreements  (CRADAs):  Both  the  data  and  biological  CRADAs 
between  Nationwide  Children’s  Hospital  (NCHRI;  Alvarez,  Lead  PI,  home  institution)/OSU  (Huang  and  Couto, 
Partnering  Pi’s)  and  DoD/USA  were  executed  by  2013. 

ii)  Animal  use  approval  (Institutional  Animal  Care  and  Use  Committee,  lACUC):  The  animal  hospital  at  Lackland 
AFB  received  AAALAC  accreditation  that  is  mandatory  for  military  lACUC  approvals  in  2012.  In  2013,  we 
submitted  final  revisions  on  our  lACUC  protocol  for  the  collection  of  biological  samples  and  Lackland  veterinary 
approval  was  granted;  and  final  Lackland  AFB  oversight  approval  was  granted  and  those  documents  were 
submitted  to  DoD  CDMRP  grant  administration.  Currently,  there  is  one  final  approval  from  ACURO  pending 
(and  expected,  according  to  their  original  anticipated  timeline,  within  ~1  month),  at  which  time  biological  sample 
collection  can  be  initiated. 


Task  2-  Data  Capture  of  Veterinary  Records:  By  having  Ms.  Michelle  Perez,  Veterinary  Technician,  embedded  in  the 
military  dog  health  service  at  Lackland  AFB,  we  have  been  acquiring  clinical  and  associated  data  from  military  dogs.  This 
was  made  possible  by  the  execution  CRADA’s  in  2013  (Task  1).  The  veterinary  clinical  cancer  and  medical  records 
expertise  was  provided  by  Dr.  Couto.  We  have  been  using  that  data  in  two  parallel  tracks,  (i)  In  the  first  track,  we  have 
been  using  data  forms  to  create  advanced  methods  for  capturing  paper-based  data  and  converting  those  to  electronic  data 
(which  is  classified  as  raw  or  manually  confirmed  to  accurately  represent  the  original)  (using  custom  form  versions  of 
ABBYY  software).  That  work  was  initiated  in  the  technical  sense  before  we  had  CRADA’s  in  place  to  use  it  on  real  DoD 
military  dog  health  records.  In  2013,  Mr.  Terry  Camerlengo  and  his  subsequent  replacement  Mr.  Jacob  Aaronson  (under 
supervision  of  Drs.  Alvarez  and  Huang)  worked  with  actual  military  dog  health  records  (scanned  by  Vet.  Tech.  Ms.  Perez 
at  Lackland  AFB)  to  create  those  custom  electronic  versions  of  paper  forms.  Specifically,  they  initiated  the  development 
of  custom  scanning  and  data  capture  from  DoD  military  dog  health  record  form  1 829  (which  are  generated  for  each  health 
visit,  providing  longitudinal  data)  and  from  AFIP/JPC  pathology  reports  (which  are  generated  for  essentially  all  diagnostic 
cancer  biopsies  and  sometimes  for  necropsy).  That  required  significant  efforts  from  ABBYY  support  and  Research  IT, 
NCHRI  to  implement.  This  effort  is  ongoing.  If  one  or  both  final  customized  forms  are  successful  in  the  near  future,  we 
will  be  able  to  scan  any  future  records  and  automatically  isolate  each  1 829  and  pathology  report.  Importantly,  we  would 
also  be  able  to  scan  the  many  prioritized  full  records  scanned  and  archived  in  our  database  in  “track  ii”.  (ii)  In  the  second 
track  that  was  initiated  in  2012  and  is  ongoing  through  2013,  we  have  used  different  indicators  to  prioritize  individual 
dogs  that  are  particularly  important  to  our  study  and  have  begun  scanning  their  complete  records  (except  for  some 
associated  clinical  test  data  that  could  not  be  scanned  -  e.g.,  EKG’s  on  thin  perforated  paper  (which  would  have  risked 
their  destruction  in  our  portable  automatic-feed  scanner).  We  are  mainly  focused  on  dogs  that  have  had  cancer  or  most 
likely  would  have  had  it  by  now  if  they  had  high  risk  (according  to  age).  We  thus  acquired  a  list  of  all  Lackland  AFB  dog 
health  records  for  which  there  are  AFIP/JPC  pathology  reports.  This  was  made  possible  by  our  primary  military  dog 
program  contact,  LTC  Cyle  Richard.  He  provided  us  that  list,  which  he  received  from  AFIP/JPC;  in  this  way,  we  did  not 
have  to  review  thousands  of  records  to  identify  those  that  contained  pathology  reports  or  cancer  diagnoses.  This  in  turn 
allowed  us  to  examine  DoD  military  dog  puppy  program  dog  (DoD  bred  dogs  vs.  purchased  dogs)  pedigrees  for  selection 
of  affected  and  unaffected  littermates  or  half  siblings.  From  this  analysis  we  identified  a  relatively  small  number  of 
popular  breeders  that  had  many  litters  with  different  partners. 


Task  3-Methodolgv  Development: 

Task  3  is  advanced  about  as  far  as  the  data  types  we  have  acquired  to  date.  Once  final  lACUC  approval  is  granted 
(expected  within  the  month)  and  we  begin  to  acquire  military  dog  samples  after,  we  expect  to  be  able  to  deploy  the 
methodologies  we  have  developed.  Specifically,  we  have  validated  the  principal  new  methods  using  data  from  previously- 
acquired  Greyhound  osteosarcoma  case  and  control  samples,  and  from  data  published  by  the  LUPA  Consortium  (Vaysse 
A  et  al.  2011.  Identification  of  genomic  regions  associated  with  phenotypic  variation  between  dog  breeds  using  selection 
mapping.  PLoS  Genet.  7(10):el002316.  PubMed  PMID:  22022279). 

In  the  first  year’s  Annual  Report,  we  included  two  manuscripts  (Rybaczyk  et  al.  and  Rowell  et  al.)  that  used  a  new 
methodology  we  developed  under  the  present  program.  Both  those  manuscripts  were  submitted  for  publication  in  leading 
genetics  journals,  and  we  have  been  addressing  reviewers  criticisms  and  advice.  Throughout  2013,  we  continued  to  refine 
and  validate  those  studies.  Specifically,  this  work  involves  the  invention  of  entirely  novel  techniques  to  conduct 
genomewide  association  analysis  or  GWAS  (Balding  2006)  and  multidimensional  statistical  analysis:  Intersection  Union 
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Testing  or  lUT  (Berger  1982; 
Berger  1997)  eombined  with 
Bootstrapping  (both  well 
established,  but  the  approaeh  has 
never  been  used  for  these 
applications). 

The  original  focus  of 
these  works  was  on  development 
of  the  lUT.  In  the  course  of 
improving  the  methods  to  address 
reviewer  comments  during  this 
reporting  year,  we  determined  that 
the  integration  of  Bootstrapping 
with  lUT  is  a  major  innovation 
and  advantage  (Fig.  1).  The 
greatest  concern  about  our 
manuscripts  was  that  the  lUT 
method  does  not  generate 
conventional  measures  of 
statistical  significance  (p-values), 
despite  the  fact  that  the  method 
empirically  ranked  lUT- 
“significant”  hits  correctly 
(according  to  detection  of  true 
positives  in  published  datasets). 
[Notably,  that  is  the  major  focus 
of  applications  of  lUT  to  biology 
and  high  throughput  gene 
expression  data.  Some  have 
proposed  solving  it  using 
Bayesian  approaches,  but  after 
many  years,  no  one  has  had 
success  doing  so.]  By  adding 
Bootstrapping  upstream  of  lUT, 
we  are  able  to  give  another  type  of 
measure  of  robustness  of  results  - 
a  confidence  (vs.  significance) 
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Bootstrap  sampling  with  replacement 
(eg  ,  1000  bootstrap  re-samples) 


Scree  plotting  shews  percent  of  sampled 
sets  in  which  a  marker  is  significant 
(separating  true  &  false  positives) 


Figure  1.  Schematic  of  integrated  Bootstrapping  and  Intersection  Union  Testing 
(lUT)  for  genetic  analysis.  (A)  The  schematic  on  the  left  shows  how  a  single  dataset  is 
repeatedly  subsampled  (with  replacement)  and  each  subsample  of  cases  and  controls  is 
then  put  through  the  lUT  compound  hypothesis:  i)  for  each  subset  of  an  lUT  group,  which 
genetic  markers  have  statistically  significant  frequency  differences  in  cases  and  controls, 
ii)  keep  only  the  markers  that  are  significant  in  all  subsets  of  an  lUT  (thus  not  requiring 
multiple  testing  correction).  Right  hand  notations  compare  our  methods,  which  are 
considered  hypothesis  tests,  to  analogous  approaches  in  the  field  of  Machine  Learning, 
which  are  not  considered  hypothesis  tests  but  rather  learning  or  predicting.  (B)  Illustration 
of  how  lUT  works  in  first  panel:  each  marker  (SNP)  is  tested  for  significance  in  each 
subset  of  an  lUT  group  (set  #1,  2,  3);  only  those  significant  in  all  are  kept.  Second  panel 
illustrates  how  repeating  the  process  on  1000  Bootstrap  replicates  (4  shown)  can  be  used 
to  plot  the  proportion  of  times  a  marker  is  positive  in  the  1000  (scree  plot,  third  panel). 


measure  (Bootstrap  Confidence 
Value,  BCV). 

In  this  reporting  year  we  discovered  strong  evidence  that  our  method  is  very  sensitive  and  specific  based  on 
analysis  of  the  genetic  contributions  to  the  complex  trait  of  dog  size  as  a  test  (using  the  Vaysse  et  al.  dataset  cited  above). 
Specifically,  we  reanalyzed  that  published  data  and,  not  only  identified  those  authors’  two  genomewide  significant  hits 
using  conventional  methods,  but  we  also  found  additional  lUT-genomewide  significant  hits  that  they  missed  (but  which 
have  been  shown  to  be  true  positives  in  other  canine  genetics  studies).  We  also  generated  new  evidence  that 
Bootstrap/IUT  methods  i)  have  increased  ability  to  detect  weak  signal  (a  critical  need  for  complex  genetics  such  as  cancer 
risk)  and  ii)  does  not  require  correction  for  population  structure  when  the  analysis  is  designed  properly.  We  did  this  by 
analyzing  the  most  complex  dog  trait  reported  by  Vaysse  et  al  (ref  above)  -  sociability  (the  response  of  a  dog  when 
approached  by  another  dog  or  a  human)  as  a  test  (experimental  support  for  these  claims  were  provided  in  figures  within 
the  Q7  and  Q8  Quarterly  Reports). 

In  addition  to  the  genetic  analysis,  we  also  face  the  challenge  of  enabling  effective  query  of  medical  terms  once 
the  database  is  completed.  Given  the  large  collection  of  biomedical  term  resources  such  as  1CD9,  ICDIO,  and 
SNOWMED-CT  for  clinical  diagnosis.  Gene  Ontology  for  gene  information,  and  other  drug  databases,  different  naming 
systems  can  significantly  affect  the  search  accuracy.  In  a  collaboration  with  Dr.  Yang  Xiang  (OSU  Biomedical 
Informatics),  we  tackle  this  issue  by  using  the  Unified  Medical  Language  System  (UMLS)  developed  by  the  National 
Library  of  Medicine  (NLM)  of  NIH.  UMLS  has  a  hierarchical  structure  for  the  medical  vocabularies  collected  from  more 
than  100  databases  including  the  ones  mentioned  above.  Each  biomedical  term  is  given  a  unique  ID.  In  order  to  map  the 
user  input  words  to  the  exact  biomedical  terms  and  IDs  in  any  query,  NLM  provides  a  set  of  tools  called  Metathesaurus 
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Browser  and  MetaMap.  However,  these  tools  are  quite  strict  on  the  input  term  and  often  fail  if  the  input  term  is  contains 
small  errors  or  even  small  discrepancy  with  the  target  term.  So  we  developed  a  new  algorithm  called  layered  dynamic 
programming  mapping  (LDPMap)  and  it  provides  much  higher  accuracy  in  mapping  the  query  terms  to  the  target  medical 
terms.  The  algorithm  was  presented  in  the  International  Conference  on  Translational  Bioinformatics  in  Seoul,  Korea,  in 
October  2013  and  the  manuscript  was  accepted  to  the  special  issue  for  BMC  Medical  Genomics  to  be  published  in  2014 
(Ren  2014). 

Task  6-  Adaptation  of  existing  resources,  data  storage  and  hosting: 

We  have  a  secure  virtual  machine  called  Research  DAPER  or  resdaper  developed  initially  by  Mr.  Camerlengo  and 
continued  by  his  replacement  Mr.  Aaronson  (supervised  by  Drs.  Alvarez  and  Huang).  The  machine  exists  on  the  secure 
NCHRl  (Alvarez)  network  behind  a  firewall.  It  can  only  be  accessed  by  highly-secure  VPN  using  two  factor 
authentication.  We  have  an  instance  Microsoft  SQL  Server  stored  on  the  machine.  Microsoft  SQL  Server  is  an  industry¬ 
leading  relational  database  product  that  we  use  to  store  all  of  our  documents  after  they  have  been  digitalized.  With  a 
relational  database,  you  can  quickly  compare  information  because  of  the  arrangement  of  data  in  columns.  The  relational 
database  model  takes  advantage  of  this  uniformity  to  build  completely  new  tables  out  of  required  information  from 
existing  tables.  In  other  words,  it  uses  the  relationship  of  similar  data  to  increase  the  speed  and  versatility  of  the  database. 
The  "relational"  part  of  the  name  comes  into  play  because  of  mathematical  relations.  Each  table  contains  a  column  or 
columns  that  other  tables  can  key  on  to  gather  information  from  that  table.  We  have  many  fields  that  we  can  filter  and  sort 
on  that  we  can  use  to  retrieve  items.  Ultimately,  this  will  include  all  clinical  and  associated  data,  environmental  data  and 
genetic  (genotype),  epigenetic,  and  genomic/molecular  (phenotype)  data.  The  user  interface  is  under  construction.  We  will 
have  a  web  user  interface  that  can  be  accessed  by  those  with  secure  credentials.  We  have  used  Microsoft  asp.net  MVC  to 
build  the  user  interface.  Using  the  model  view  controller  pattern  gives  us  the  benefit  of  separating  the  representation  of 
information  from  the  user's  interaction  with  it  .The  model  consists  of  application  data,  business  rules,  logic,  and  functions. 
A  view  can  be  any  output  representation  of  data,  such  as  a  chart  or  a  diagram.  The  controller  mediates  input,  converting  it 
to  commands  for  the  model  or  view. 

In  Task  2(i)  we  discussed  the  conversion  of  paper  health  records  to  digital  versions  using  ABBYY  software  - 
mainly  the  1829  form  and  the  ALIP/JPC  pathology  reports.  That  digitized  data  will  be  fully  accessible  and  searchable 
through  the  web  interface  mentioned  above.  In  addition,  the  Task  2(ii)  scanned  complete  veterinary  clinical  records  will 
be  directly  linked  as  PDL  format.  This  will  allow  analysis  of  digitized  data  with  the  option  of  follow-up  detailed  analysis 
of  full  health  records  on  the  same  database/tools  ensemble  “resdaper”  (or  confirmation/cross-validation  of  critical  data). 
We  have  thus  installed  the  ABBYY  ElexiCapture  software  and  all  of  the  components  which  include  The  Processing 
Server.  That  is  the  server  that  controls  the  operation  of  the  Processing  Stations.  We  installed  the  Licensing  Server,  the 
server  that  stores  and  manages  licenses.  We  installed  the  Application  Server,  the  server  that  controls  the  operation  of  the 
other  components.  We  installed  the  Application  Server  components,  which  will  allow  operators  to  connect  to  the  server 
and  work  using  a  web-browser.  We  also  have  the  Application  Server  component  which  allows  operators  of  web  stations  to 
register  with  the  system  and  create  requests  for  access  rights  to  the  web  station.  It  provides  operators  of  web  stations  with 
a  single  entry  point  into  the  system. 


Task  7:  Pathway  analysis  and  functional  characterization. 

Task  7a  is  complete.  1  (Alvarez)  have  been  conducting  extensive  data  mining  and  analysis  that  are  honing  those  skills 
which  will  ultimately  be  applied  to  the  study  of  cancer  in  military  dogs.  That  includes  work  on  osteosarcoma  risk 
candidate  genes  from  Greyhounds  (to  be  published  in  Rowell  et  al.  manuscript  mentioned  above)  and  LUPA  candidate 
genes  for  multiple  canine  traits  (also  discussed  above).  Most  importantly,  the  Greyhound  study  implicated  small  genomic 
regions  with  one  or  two  genes  each.  This  allowed  use  of  human  cancer  data  and  analysis  servers  to  predict  which  were 
likely  to  be  cancer  genes  and  whether  the  human  evidence  suggested  the  cancer  risk  gene  variant  was  likely  to  result  in  up 
or  down  regulation.  For  example,  the  IntoGen  server  permits  analysis  of  gene  expression  and  genome  alterations 
associated  with  diverse  cancer  types.  But  other  analysis  servers,  such  as  NextBio,  Oncomine,  KMplot  and  BioGPS 
provide  different  tools  to  mine  the  same  gene  expression  data  in  very  different  ways.  For  example  NextBio  make  meta¬ 
analysis  of  any  subset  of  studies  and  KMplot  generates  Kaplan  Meier  survival  plots  for  a  subset  of  cancer  types  that  have 
very  large  numbers  of  data  available.  With  this  data  in  hand,  it  is  possible  to  generate  hypotheses  and  to  conduct  cross- 
validation  studies.  For  example,  in  the  Greyhound  osteosarcoma  case,  we  can  test  those  predictions  by  analyzing  genetic 
association  candidates  in  a  canine  osteosarcoma  tumor  gene  expression  dataset  which  includes  Greyhound,  Golden 
Retrievers,  Rottweiler’s  and  mixed  breed  dogs.  Because  there  are  orders  of  magnitude  more  human  data  than  canine,  it  is 
critical  to  be  able  to  make  use  of  it. 

Among  the  major  aspects  of  genetic/genomic  studies  are  contextualization  according  to  biochemical  or  genetic 
pathways,  cross-dimensional/platform  validation,  and  comparative  genomics/cross-species  validation.  To  that  end,  1  have 
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conducted  studies  in  these  aspects  of  cancer  genetics.  Among  those,  1  mined  for  genetic  evidence  that  the  enzyme 
aldehyde  dehydrogenase  is  involved  in  multiple  myeloma  (for  which  there  is  experimental  evidence  generated  by  a 
collaborator  studying  this  with  their  own  funding).  As  a  result  of  the  latter  analysis,  my  analyses  were  added  to  a 
manuscript  that  was  recently  accepted  for  publication.  Although  the  following  work  was  not  based  on  our  military  dog 
data,  my  contributions  involve  the  same  analyses  that  will  be  conducted  with  canine  cancer  candidate  genes:  Yasmeen  R., 
Meyers  J.  M.,  Alvarez  C.  E.,  Thomas  J.  L.,  Bonnegarde -Bernard  A.,  Alder  H.,  Papenfuss  T.  L.,  Benson  D.  M.  Jr,  Boyaka 
P.  N.,  Ziouzenkova  O.  (2013)  Aldehyde  dehydrogenase- lal  induces  oncogene  suppressor  genes  in  B  cell  populations. 
Biochim  Biophys  Acta  1833:3218-3227.  (See  Appendix  II)  For  example,  1  conducted  the  analysis  shown  in  Figs.  6A  and 
6C.  That  critical  information  shows  that  the  biology  suggested  by  the  Y asmeen  et  al.  molecular/biochemical  study  can  be 
cross-validated  by  public  datasets  involving  other  types  of  evidence  (here  gene  expression).  Similarly,  we  expect  that  the 
vast  data  available  on  human  cancers  will  yield  supporting  evidence  for  canine  cancer  findings  from  the  project  that  is  the 
subject  of  this  report. 


Task  8-  Project  management.  Quality  control  and  assurance,  and  Security: 

The  most  important  change  in  this  reporting  year  is  the  execution  of  the  CRADA’s  which  allowed  us  to  acquire  DoD 
military  dog  data.  We  established  a  footprint  at  Fackland  and  implemented  security  protocols  in  accordance  with  our 
agreements.  We  are  conducting  quality  control  evaluations  for  our  data  collection  techniques  to  assure  that  we  are 
collecting  appropriate  data.  Once  we  have  assured  high  quality  data  we  will  begin  automated  import  into  the  database.  We 
are  also  cross-validating  medical  and  pathology  records  to  assure  accurate  diagnosis.  We  initiated  collaborations  with  Dr. 
David  Gutman  at  Emory  University  and  hope  to  use  his  automated  pathology  data  base  to  facilitate  confirmation  of 
sample  classification. 

As  of  June  1st,  2013,  Task  8  duties  attributed  to  Dr.  Rybaczyk  (who  has  moved  on  in  his  academic  career,  as  an 
NIH  T32  Fellow,  Michigan  State  U.)  are  being  done  by  Dr.  Alvarez.  This  transition  was  been  smooth.  A  job  listing  was 
posted  for  a  replacement  postdoctoral  fellow.  Dr.  Alvarez  interviewed  a  highly-qualified  postdoctoral  fellow  named  Dr. 
Sohan  Fal  (currently  postdoctoral  fellow  at  Yale),  but  unfortunately  Dr.  Fal  was  forced  to  accept  another  position  at  Yale 
due  to  imminent  expiration  of  his  visa  status.  There  is  another  candidate  under  consideration;  the  goal  is  to  hire  that 
person  prior  to  initiating  the  biological  sample  collection. 

The  replacement  for  Mr.  Camerlengo  -  computer  programmer  -  was  a  success.  His  role  has  been  taken  up  by  Mr. 
Jacob  Aaronson,  who  may  not  be  as  experienced  as  Mr.  Camerlengo  but  appears  to  have  greater  affinity  for  the 
biomedical  aspects  of  computational  sciences.  In  particular,  he  is  a  research  staff  in  the  Informatics  Research  & 
Development  Team  of  the  OSU  Department  of  Biomedical  Informatics  and  has  extensive  experience  in  developing 
databases  and  webtools/interfaces  for  biomedical  applications  in  the  Medical  Center.  Mr.  Aaronson  quickly  completed  his 
NCHRl  orientation,  security  clearance/ID  badge,  and  vaccination  requirements.  Most  importantly,  he  rapidly  oriented 
himself  in  the  project  and  is  performing  high  quality  work. 


Key  Research  Accomplishments 

•  Execution  of  institutional  agreements  (CRADA’s)  between  NCHRl  (Alvarez)/OSU  (Huang,  Couto) 

•  Completion  of  all  facets  of  lACUC  between  NCHRl  and  Fackland  AFB  through  final  Fackland  AFB  oversight 
approval  (currently  waiting  for  final  ACURO  approval  expected  within  ~1  month) 

•  Successful  embedding  of  NCHRl  (Alvarez)  Veterinary  Technician,  Ms.  Michelle  Perez  within  the  military  dog 
health  service  at  Fackland  AFB 

•  Successful  scanning  of  veterinary  clinical  records  by  Ms.  Perez  at  Fackland  AFB,  transmission  of  encrypted  data 
to  NCHRl,  and  uploading  to  DAPER  database 

•  Continued  development  and  validation  of  a  scale  free,  high-power  statistical  methodology  capable  of  resolving 
signal  from  noise  in  high  throughput  genetic/genomic  data  (lUT/GlA)  by  incorporation  of  Bootstrapping 

•  GIA  manuscripts  continue  to  be  refined  since  receiving  comments  from  peer  reviewers 

•  GIA  grant  application  to  NIH  is  being  refined  based  on  peer  reviewer  critiques 

•  Expansion  of  our  highly  flexible  data-infrastructure  that  is  robust  enough  to  handle  military  working  dog  records 
and  queries  of  said  records 

•  Initiation  of  high  through-put  software  customization  (ABBYY  FlexiCapture)  for  analysis  of  1 829  longitudinal 
veterinary  records  and  AFIP/JPC  pathological  records 

•  Initiation  of  DoD  military  dog  pathology  reports  to  identify  cancer  bearing  dogs  for  cancer  classification  and 
selection  of  cases  and  controls 
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•  Initiation  of  DoD  military  dog  “puppy  program”  pedigree  analysis  for  identifieation  of  high  and  low  cancer  risk 
lineages 

•  Development  of  LDPMap  algorithm  for  mapping  query  terms  to  the  exact  biomedical  terms  in  UMLS. 

Reportable  Outcomes 

•  Dr.  Jennie  Rowell,  having  received  her  PhD  from  OSU  for  her  work  at  NCHRI  (Alvarez),  joined  the  lab  of  one  of 
two  pre-eminent  dog  geneticists  in  the  world,  Elaine  Ostrander,  NIH,  as  postdoctoral  fellow.  The  first  week  of 
Nov.  2013,  she  has  a  job  interview  for  a  tenure  track  position  at  the  College  of  Nursing,  OSU 

•  Expansion  of  DAPER  database  capabilities  maintaining  strong  security 

•  Mr.  Terry  Camerlengo  moved  from  OSU  to  the  Battelle  Institute  as  a  senior  informatics  developer.  Mr.  Jacob 
Aaronson  from  OSU  Biomedical  Informatics  IR&D  team  has  successfully  replaced  Mr.  Camerlengo’s  role. 

•  Manuscript  for  the  EDPMap  algorithm  developed  in  the  collaboration  between  Dr.  Huang  and  Dr.  Xiang  is 
accepted  to  a  special  issue  in  BMC  Medical  Genomics. 


Conclusion 


The  project  accelerated  when  the  CRADA’s  were  executed.  In  the  first  two  years,  we  optimized  the  primary  genotyping 
and  molecular  methods,  and  the  follow-on  validation  methods.  We  also  expanded  the  capabilities  of  our  highly-fiexible 
DAPER  database  and  software  tools  in  the  present  reporting  year.  In  the  first  year  we  invented  an  entirely  novel  approach 
to  conducting  genome  wide  genetic  association  (GWA)  analysis  -  genomewide  lUT  analysis  (GIA);  and  in  the  second 
year  we  further  validated  it.  In  this  second  reporting  year,  we  integrated  lUT  and  Bootstrapping  as  an  additional 
innovation  with  outstanding  utility.  Dr.  Alvarez’s  presentation  of  these  methods  and  results  to  leaders  in  the  fields  of 
genetics  and  canine  genetics  resulted  in  uniformly  positive  feedback  from  them  (and  multiple  requests  for  collaboration). 
In  addition.  Dr.  Huang  has  developed  the  LDPMap  algorithm  for  enabling  accurate  query  of  biomedical  terms  in  the 
database.  We  expect  to  publish  the  two  revised  manuscripts  on  GIA  (one  on  methods,  one  on  empirical  cancer  mapping) 
shortly,  but  the  latter  may  be  delayed  while  we  analyze  new  supporting  data  acquired  from  Dr.  Lindblad-Toh.  In  addition, 
we  co-authored  (Alvarez)  a  published  study  that  was  not  based  on  the  present  military  dog  project,  but  which  made  use  of 
the  same  data  mining  and  analysis  methods  that  will  be  used  in  our  study.  The  LDPMap  algorithm  paper  is  accepted  to 
BMC  Medical  Genomics.  Dr.  Rowell,  one  of  our  investigators  (originally  as  a  predoctoral  student),  moved  on  to  conduct 
a  postdoctoral  fellowship  with  a  pre-eminent  dog  geneticist  at  NIH  and,  after  only  a  year  there,  is  being  recruited  for  a 
tenure  track  faculty  position  at  OSU.  Dr.  Rybaczyk,  another  of  our  investigators  (originally  a  postdoctoral  fellow  and 
promoted  to  research  scientist)  went  on  to  be  an  NIH  T32  Fellow  at  MSU,  which  is  essentially  a  pre -faculty  position.  Dr. 
Alvarez  was  promoted  to  Associate  Professor  with  tenure  by  OSU  and  is  now  under  consideration  for  leadership  training 
in  the  OSU  College  of  Medicine. 
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Descriptive  Titie:  Statistical  techniques  for  optimized  design  and  power  in  high- 
content  genomics 

Submission  Title: 


Opportunity  ID: 

Opportunity  Title: 

Technology  (R21) 

Agency  Name: 


PAR-09-219 

Exploratory  Innovations  in  Biomedical  Computational  Science  and 

National  Institutes  of  Health 


2.  SPECIFIC  AIMS.  This  application  is  in  response  to  PAR-09-219,  Exploratory  Innovations  in  Biomedical 
Computational  Science  and  Technology;  it  address  research,  development  and  application  of  analytical  and 
statistical  tools  for  interpretation  of  large  biological  data  sets,  and  associated  software.  The  flood  of  biological 
data  has  highlighted  limitations  to  signal  detection.  Here  we  propose  that  combining  optimized  experimental 
design  and  novel  uses  of  statistical  methods  can  dramatically  increase  the  power  of  signal  detection.  These 
approaches  will  be  applicable  to  myriad  data  types  and  their  integration.  However,  this  proposal  will 
demonstrate  validity  using  a  highly  innovative  approach  to  complex  genetics.  We  will  conduct  a  Genome  Wide 
Association  (GWA)  study  using  high  density  genotyping  that  not  only  provides  binary  single  nucleotide 
polymorphism  (SNP)  allele  data,  but  also  total  SNP  signal  and  allele  ratios  (which  can  be  affected  by  DNA 
copy  number  variation,  CNV).  In  Preliminary  Studies  we  demonstrate  the  feasibility  of  using  allele  ratios  as 
continuous  variables  to  map  disease  loci.  This  is  the  first  such  GWA  study  of  comprehensive  CNV  information 
without  prior  classification  of  markers  as  CNV.  Our  hypothesis  is  that  implementation  of  our  algorithm  on 
multiple  (experimentally  standardized)  groups  dramatically  increases  the  power  to  detect  biological  signal. 

Experimental  design.  The  now  common  use  of  thousands  or  tens  of  thousands  of  subjects  in  genetic 
studies  can  be  attributed  to  genetic  heterogeneity/complexity  and  diverse  confounds  of  meta-analysis.  A  major 
limitation  is  the  extreme  multiple-testing  burden  in  GWA,  which  is  commonly  done  by  Chi-Square  testing  of  one 
million  markers.  In  Preliminary  Studies,  we  address  these  issues  by  1)  conducting  complex  disease  mapping 
studies  in  one  dog  breed,  which  has  100-fold  reduced  genetic  variation  compared  to  humans,  and  2)  using 
multiple,  but  experimentally  identical,  case-control  sets  or  batches.  In  this  way,  there  are  reduced  numbers  of 
disease-associated  markers  in  a  simpler  background  and  we  can  apply  an  Intersection  Union  Test  (lUT) 
across  experiments  (in  place  of  Bonferroni  multiple-test  correction).  Computational  statistics.  The 
overarching  goal  of  the  proposed  analytical  approaches  is  based  on  the  information  theory  concept  that  the 
more  manipulations  or  corrections  are  implemented,  the  more  information  is  lost.  We  propose  here  that  this 
loss  of  information  can  be  eliminated  in  diverse  types  of  biological  data  by  integrating  two  elements.  In  the  first, 
we  use  analysis  of  covariance  (ANCOVA)  to  correct  continuous  variable  data  for  latent  known  biological 
confounders  such  as  group  membership.  In  the  second,  we  make  use  of  optimized  study  design  (specifically, 
using  multiple  case-control  groups  for  a  given  experiment)  to  perform  lUT.  Others  recently  validated  a  similar 
use  of  lUT  independently.  In  Preliminary  Studies,  we  demonstrate  validation  of  the  integrated  ANCOVA  and 
lUT.  We  confirm  that  the  use  of  lUT  on  multiple  sets  is  a  more  effective  solution  to  the  three  reversal 
paradoxes  (Yule-Simpson,  Lord's,  and  suppression)  which  share  the  characteristic  that  the  association 
between  two  variables  can  be  reversed,  diminished,  or  enhanced  when  another  variable  is  statistically 
controlled  for.  Notably,  we  are  first  to  address  these  in  the  context  of  continuous  genomic  variables. 

Aim  1:  Demonstrate  on  large  datasets  the  ability  of  ANCOVA  to  correctiy  identify  bioiogicaiiy  reievant 
phenomena  that  are  iinked  to  a  disease  trait.  ANCOVA  has  been  applied  to  correct  for  baseline  variables  in 
various  fields,  such  as  psychology  and  epidemiology.  Despite  similarities  in  variable  types,  data  structure,  and 
confounds,  ANCOVA  has  never  been  applied  to  large  scale  genetic  datasets.  We  will  analyze  different  types  of 
genomic  datasets  (our  own  and  from  the  public  domain)  with  well-established  population  confounds  and  show 
that  ANCOVA  is  the  most  effective  way  of  removing  those. 

Aim  2:  Appiication  of  iUT  for  genetic  anaiysis,  aiiowing  for  muitipie  corrections  without  manipuiation  of 
individuai  datasets.  We  propose  to  demonstrate  the  ability  of  IUT  to  detect  complex  genetics  in  a  disease 
phenotype  and  how  combining  IUT  with  ANCOVA  will  allow  the  detection  of  genetic  determinants.  The  non- 
obvious  advancement  of  this  method  is  that  it  incorporates  information  theory  by  minimally  altering  the  data 
before  analyzing  it.  This  retains  the  maximum  amount  of  information  for  each  measure.  It  also  does  not 
assume  linear  relationships  with  latent  variables. 

Aim  3:  We  wiii  vaiidate  our  ciaim  that  ANCOVA  and  iUT  are  more  powerfui  than  traditionai  techniques. 

We  will  replicate  a  published  canine  complex-genetics  mapping  study  using  fewer  individuals  to  demonstrate 
that  our  technique  is  able  to  detect  the  same  loci  in  addition  other  variants  missed  by  traditional  techniques. 
We  will  also  conduct  a  novel  GWA  study  of  a  human  medically  relevant  complex  trait  in  a  second  dog  breed. 


3.  RESEARCH  STRATEGY 


(a)  SIGNIFICANCE 

We  will  develop  and  implement  analytical  and  statistical  tools  (and  software)  for  interpretation  of  large 
biological  data  sets.  The  explosion  of  biological  data  has  made  prominent  several  limitations  to  signal 
detection.^  We  demonstrate  in  Preliminary  Studies  that  combining  optimized  experimental  design  and  novel 
application  of  statistical  approaches  can  dramatically  improve  signal  detection.  These  methodologies  will  be 
applicable  to  analytical  challenges  of  myriad  data  types  and  their  integration  [^],  including  genomics  [^],  high 
throughput  (HT)  sequencing  ["^j,  population  biology  and  genetics  and  gene/organism/environment 
interactions  f].  The  improvements  described  here  address  the  basic  concept  of  information  theory  that  more 
manipulations  of  data  equals  more  information  loss.  Among  the  areas  addressed,  are  1)  application  of  analysis 
of  covariance  (ANCOVA;  [®])  to  correct  continuous  variable  data  for  latent  known  biological  confounders  as  well 
as  potentially  avoiding  the  three  reversal  paradoxes  (Yule-Simpson,  Lord's,  and  suppression),  which  share  the 
characteristic  that  the  association  between  two  variables  can  be  reversed,  diminished,  or  enhanced  when 
another  variable  is  statistically  controlled  for  and  2)  multiple  new  applications  of  the  Intersection  Union 
Test  (lUT;  [^^]),  including  GWA,  as  was  independently  developed  by  another  investigator  very  recently  [^^].  This 
proposal  thus  offers  solutions  and  software  to  address  critical  barriers  to  genomic  analysis,  simultaneously 
improving  scientific  knowledge  and  technical/analytical  capabilities. 

(b)  INNOVATION 

Multiple  phenotypic  traits  (such  as  height  or  weight)  are  often  treated  as  independent  from  the  effect  under 
study,  but  that  neglects  the  reality  that  many  traits  are  linked  to  other  genetic  and  environmental  modifiers. 
Others  incorporate  and  calculate  variances  based  on  environmental  or  geographic  stratifications.  However,  this 
ignores  synergism  between  the  organism,  its  immediate  surroundings,  and  the  greater  environment.  While  it  is 
not  possible  to  measure  and  analyze  every  part  of  the  environment,  some  baseline  state  must  be  identified 
from  which  deviation  can  be  measured  to  test  a  priori  hypotheses.  In  the  absence  of  this  uniform  baseline, 
almost  all  statistical  measures  will  fail  to  adequately  detect  regions  of  interest.  This  application  will 
demonstrate  feasibility  and  innovation  in  preliminary  studies  (c.5)  using  an  entirely  new  approach 
(ANCOVA/lUT)  to  conducting  genome  wide  association  (GWA)  genetics  based  on  continuous  variable 
data.  An  important  challenge  to  GWA  that  relates  to  these  issues  above  is  population  structure  (i.e.,  correcting 
genetic  studies  for  non-disease-associated  allele  frequencies  that  vary  in  human  populations).  Two  common 
ways  to  address  this  are  traditional  meta-analytic  techniques  and  lUT.  But  these  approaches  are  selected 
more  out  of  necessity  than  experimental  design  concerns.  The  majority  of  combinatorial  studies  have  focused 
on  publicly  available  datasets.  Each  of  the  individual  datasets  contains  differing  degrees  of  artifactual  bias  and 
other,  potentially  unrelated,  variables.  Oncomine’s  and  other  algorithms  applying  this  strategy  to  gene- 
expression  have  some  success  but  it  has  not  been  the  panacea  originally  prognosticated. 

Multivariate  and  integrative  analyses  can  potentially  solve  many  issues  associated  with  genome  wide 
studies. However,  they  are  limited  by  their  ability  to  synthesize  data  into  useful  parcels  of  information  that 
are  applicable  clinically  or  to  research.  Integrative  analysis  has  the  benefit  of  alternative  testing.  While  multiple 
testing  using  the  same  measures  and  techniques  increases  error  rates  [^^],  alternative  testing  allows 
measurement  of  the  same  effect  using  different  types  of  measures.  As  these  are  subjected  to  different  analytic 
techniques,  the  posterior  probability  of  false  positives  is  reduced.  Even  with  this  strength,  it  is  limited  by  biases 
and  assumptions  associated  with  individual  measures.  Ultimately  the  question  of  how  to  appropriately  identify 
genetic  contributions  independent  of  latent  confounds  has  not  been  conclusively  answered.  The  gold  standard 
for  analyses  is  univariate  testing.  While  geneticists  talk  about  penetrance  in  relation  to  populations  and 
percentages,  the  statistical  actuality  is  that  penetrance  describes  odds  ratios.  Establishing  causation  and 
deviation  from  population  norms  using  case-control,  linkage,  or  association  analyses  requires  certain 
assumptions  to  be  accepted  that  biologically  may  or  may  not  be  perilous  to  the  analysis.  While  this  is  important 
to  ethologists  and  population  geneticists,  attempting  to  compensate/account  for  these  phenomena  hinders  and 
complicates  analyses.  We  are  interested  in  identifying  biological  outcomes  that  are  well  described  and  were 


not  concerned  with  tangential  characteristics  of  the  effect.  To  this  end,  we  sought  to  isolate  rather  than 
compensate  for  effects.  When  examining  multidimensional  data  it  is  easy  to  disregard  the  interaction  of 
dimensions.  Most  dimensional  reduction  techniques  measure  and  condense  data  so  that  interdimensional 
effects  can  be  quantified.  Priming  effects  can  drastically  alter  these  techniques  and  limit  their  usefulness.  For 
this  reason  we  applied  ANCOVA  [®]  to  remove  independent  effects  from  dependent  effects  prior  to  dimensional 
reduction.  Here  we  show  adjusted  and  un-adjusted  measures  to  illustrate  how  the  application  of  ANCOVA  prior 
to  traditional  techniques  is  capable  of  increasing  the  sensitivity  of  a  study,  as  well  as  the  potential  to  correct  for 
the  reversal  paradoxes  (c.5.  P.S.,  Study  Design)  by  comparison  to  traditional  normalization  techniques. 

(c)  APPROACH 

C.1.  Research  team.  The  multidisciplinary  team  is  ideally  suited  for  this  project.  Dr.  Alvarez  (PI)  is  PI  in 
Molecular  and  Human  Genetics,  Nationwide  Children’s  Hospital  Research  Institute,  with  a  tenure  track 
academic  appointment  at  The  Ohio  State  University  College  of  Medicine.  He  has  extensive  expertise  in 
molecular  and  human  genetics  and  genomics,  bioinformatics,  and,  from  management  level  industry  experience 
(Novartis  Research),  the  discovery  and  validation  of  new  drug  targets  and  biomarkers.  Dr.  Leszek  Rybaczyk 
(Research  Scientist,  Alvarez  Lab)  is  expert  in  statistical  bioinformatics.  Dr.  Huang  Kun  (Co-I)  is  co-director  of 
the  OSU-CCC  Biomedical  Informatics  Shared  Resource.  His  research  is  focused  on  developing  bioinformatics 
tools  for  systems  biology  and  research.  Here  he  will  be  responsible  for  developing  and  implementing  the 
software  package.  The  advanced  statistics  expertise  will  come  from  a  long  term  collaborator  of  the  three 
investigators  named  above.  Dr.  Pramod  K.  Pathak  (consultant,  MSU).  He  is  a  theoretical  and  applied 
statistician  with  specific  interests  in  statistical  methods  and  their  applications  to  biomedical  research,  sampling 
and  resampling  methods,  computational  statistics,  reliability,  and  optimization  problems  in  statistics. 

C.2.  Research  strategy  (RS).  Note:  As  the  approach  has  statistical  components  addressing  different  biology, 
we  will  explain  the  approach  once,  in  Research  Strategy,  and  establish  feasibility  in  Preliminary  Studies. 

RS  Aim  1.  We  propose  to  address  these  gaps  by  applying  statistically  proven  methodologies  in  novel  ways. 
ANCOVA  has  been  applied  in  various  fields  such  as  psychology  and  epidemiology  to  correct  for 

baseline  variables. Despite  the  similarities  in  variable  types,  data  structure,  and  problems  with  confounds 
ANCOVA  has  never  been  applied  to  large  scale  genetic  datasets.  Aim  1:  Demonstrate  on  a  large  dataset  the 
ability  of  ANCOVA  to  correctly  identify  biologically  relevant  phenomena  that  are  linked  to  a  disease  trait.  The 
rationale  and  technical  approach  for  this  aim  are  well  elaborated  in  c.5.  Preliminary  Studies.  Canine  genetic 
data  similar  to  those  generated  in  Preliminary  studies  will  be  generated  from  1)  36  Scottish  Deerhounds:  18 
osteosarcoma  cases  and  18  controls  (i.e.,  three  case-control  batches  of  six  and  six),  as  well  as  2)  36 
Doberman  (18  with  cervical  spondylomyelopathy  and  18  controls  (i.e.,  three  case-control  batches  of  six  and 
six).  In  addition,  we  will  analyze  diverse  genomic  datasets  from  the  public  domain  (including  human  SNP 
GWA,  gene  expression,  and  HT-sequencing).  For  example,  by  using  TCGA  data,  in  which  the  same  patient’s 
tissue  was  assayed  on  different  microarrays  in  different  laboratories,  using  an  ANCOVA  approach  we  will 
identify  the  most  biologically  relevant  factors.  We  will  expand  that  by  looking  not  only  at  the  cancer  type,  but 
also  at  the  laboratory  where  the  tissue  was  processed;  the  date  on  which  it  was  processed,  etc.,  and 
identify/potentially  remove  such  intrinsic  errors. Power  analysis.  Based  on  our  ongoing  genetic  studies  (see 
Preliminary  Studies),  we  assumed  that  potentially  relevant  SNPs  will  reduce  the  total  of  173,000  SNPs  to  1700 
[MD  Anderson  Bioinformatics  server  with  power  of  0.8,  acceptable  false  positives  of  1,  SD  of  0.7.  With  the 
sample  size  of  36  dogs  in  each  breed  (18  cases  and  18  controls)  we  will  have  80  %  to  detect  2-fold  differences 
in  B  allele  frequency  between  cases  and  controls  for  candidate  SNPs  of  interest  (per  SNP  alpha  =  0.00059). 
This  is  conservative,  as  ANCOVA  and  lUT  would  only  reduce  the  variance. 

RS  Aim  1  Potential  pitfalls  and  contingencies.  (1)  A  limitation  to  using  the  integrated  ANCOVA/lUT 
on  biological  data  is  that  it  is  only  applicable  for  continuous  variable  data.  While  this  excludes,  say, 
conventional  binary-genotype  GWA  analysis,  we  address  this  need  with  the  development  of  an  lUT-alone 
approach;  this  use  is  now  validated  by  us  (see  c.2.  RS  Aim  3  Expected  results.  Example  1)  and  by  a  second 
independent  group. Moreover,  much  genetic  data  (e.g.,  array  CGH,  HT-sequencing)  and  most  genomic  data 
has  continuous  variables  (microarray  and  HT-sequencing  based  RNA  expression  and  epigenetics,  proteomics, 
metabolomic,  etc.).  (2)  Another  potential  concern  is  the  need  for  clear  understanding  of  appropriate  data 
structure.  For  that  reason,  we  chose  to  make  this  proposal  not  only  about  the  statistical  methods,  but  also 


about  experimental  design.  We  will  make  a  major  effort  to  document  the  proper  use  of  these  algorithms  in 
publications  and  software  Help  documentation.  (3)  Lastly,  these  methods  are  computationally  intensive.  This 
will  not  affect  us,  as  Dr.  Huang  (Co-1)  is  Director  of  Bioinformatics  and  has  access  to  the  OSU  Supercomputer 
Center.  Despite  the  computational  demands,  the  methods  proposed  here  offer  analytical  abilities  that  are 
unique  and  state  of  the  art,  and  are  sure  to  gain  wide  use.  We  believe  that  our  optimization  studies  and  careful 
statistical/software  instructions  will  facilitate  the  most  efficient  implementation  of  our  algorithms. 

RS  Aim  2.  A  second  statistical  technique,  the  Intersection  Union  Test,  has  been  gaining  use  in  the 
genomics  field. The  lUT  increases  power,  but  also  increases  type  I  error  as  the  number  of  comparisons 
increases. However,  because  of  the  many  latent  confounds  that  cannot  be  accounted  for  in  most  genomic 
work,  the  lUT  is  the  most  elegant  solution  to  reducing  these  errors. For  instance,  in  large  datasets  where  a 
multitude  of  tests  are  conducted  under  traditional  techniques,  a  multi-testing  correction  would  need  to  be 
applied.  However,  as  we  previously  demonstrated  using  the  lUT,  the  probability  of  any  specific  false  positive 
decreases  exponentially  with  the  addition  of  new  datasets. This  is  because  the  probability  of  detecting  the 
same  false  positive  in  two  independent  datasets  is  the  multiple  of  a,  traditionally  0.05.  For  two  datasets  the 
probability  of  the  same  false  positive  being  detected  is  0.0025,  for  three  it  is  0.000125,  and  so  on.  This  can 
compensate  for  even  large  datasets.  In  datasets  with  173,000  variables  (SNP  arrays  used  in  preliminary 
studies),  using  between  4  and  6  independent  datasets  would  eliminate  all  false  positives.  Conversely  if  the 
same  signal  is  being  detected  in  6  datasets  the  probability  that  it  is  due  to  chance  is  of  the  order  1.5x10'®.  Aim 
2:  lUT  is  powerful  new  tool  for  genetic  analysis  and  allows  for  multiple  corrections  without  manipulation  of 
individual  datasets.  We  purpose  to  demonstrate  the  ability  of  lUT  to  detect  complex  genetics  in  a  disease 
phenotype  and  how  combing  lUT  with  ANCOVA  will  allow  the  detection  of  genetic  determinants  and  potentially 
explain  penetrance.  The  non-obvious  advancement  of  this  method  is  that  it  incorporates  information  theory  by 
minimally  altering  the  data  before  analyzing  it.  This  retains  the  maximum  amount  of  information  for  each 
measure.  The  lUT  is  also  not  hampered  by  many  of  the  assumptions  of  other  tests. 

RS  Aim  2  Potential  pitfalls  and  contingencies.  The  lUT  is  dependent  on  having  a  common  variable 
across  all  data  sets  used  in  the  analysis.  This  variable  can  be  very  broad  such  as  dog  breed  or  very  narrow 
such  as  a  molecular  phenotype.  Regardless,  the  lUT  will  only  answer  questions  related  to  the  common 
variable  among  data  sets.  One  way  to  correct  for  that  is  in  the  initial  study  design.  The  study  design  should 
take  into  account  all  of  the  limitations  associated  with  the  various  statistical  tests  a  priori.  As  we  recently 
discussed  in  a  publication,  applying  the  lUT  to  unrelated  data  sets  will  result  in  the  elimination  of  all  signal. 

RS  Aim  3  rationale.  Large  scale  studies  that  use  traditional  GWA  require  large  patient  populations  to 
achieve  adequate  power  (and  have  yet  to  explain  a  significant  portion  of  the  heritability  associated  with  most 
diseases). This  has  serious  pragmatic  and  ethical  implications.^^  It  also  poses  several  experimental  design 
problems  as  independent  irrelevant  variables  -  e.g.,  in  genetics,  population  structure,  can  overpower  the  effect 
of  interest.^®  Manipulation  of  data  by  Principal  Component  Analysis  (PCA)  after  merging,  or  applying 
normalizations,  hinge  on  the  assumption  that  the  interactions  are  linear.  If  the  interactions  are  non-linear, 
applying  these  corrections  can  make  analysis  more  difficult.^®  Aim  3;  We  propose  to  demonstrate  that 
ANCOVA  and  lUT  are  more  powerful  than  the  traditional  techniques  by  identifying  a  study  and  replicating  that 
study  using  fewer  patients  and  demonstrating  that  our  technique  is  able  to  detect  the  same  signal  in  addition 
other  variants  missed  by  the  more  traditional  techniques. 

RS  Aim  3  Genetic  studies  experimental  plan.  As  we  did  in  Preliminary  Studies  (c.5.,  using  the  same 
lllumina  173,000  SNP  array),  we  will  conduct  GWA  analysis  of  two  complex  traits,  each  with  high  incidence  in 
a  dog  breed.  Mapping  (1)  As  validation  of  a  complex  trait  that  has  been  mapped  using  a  conventional  genetic 
approach  and  published,  we  will  map  osteosarcoma  in  Scottish  Deerhounds  (one  locus  of  dominant  effect  with 
evidence  of  linkage  (Zmax=5.766)).®°  The  original  work  used  a  4-generation  pedigree  where  60  Deerhounds 
were  genotyped  and  the  genotypes  of  70  others  were  inferred,  for  a  total  of  130  dogs.  We  will  replicate  that 
study  using  the  methods  developed  in  this  proposal  to  conduct  GWA  (ANCOVA/lUT  on  B  allele  frequency  data 
and  lUT  on  allele/genotype  data)  on  18  Deerhound  cases  and  18  controls  (i.e.,  three  case-control  batches  of 
six  and  six).  Mapping  (2)  In  order  to  immediately  draw  high  impact  attention  to  our  innovative  approaches,  we 


propose  to  conduct  GWA  of  a  prominent  breed-specific  complex-genetic  condition  with  high  human  relevance 
-  “wobblers”  or  cervical  spondylomyelopathy  in  Doberman  Pinschers  (reported  to  explain  2.5%  of  proportional 
mortality  in  the  breed). We  have  been  collaborating  for  over  a  year  with  Ronaldo  da  Costa,  our  OSD 
colleague  who  is  a  leading  authority  in  this.^^  We  are  currently  conducting  pedigree  analysis  on  -1000 
Dobermans  (showing  strong  evidence  of  heritability;  data  not  shown),  and  have  initiated  collection  of 
blood/DNA  samples.  Using  the  Doberman  wobblers  pedigree,  we  will  select  optimal  informative  dogs  to 
conduct  a  mapping  study  with  18  cases  and  18  controls  (i.e.,  three  case-control  batches  of  six  and  six).  Power 
analysis.  See  c.2.  RS  Aim  1,  end  of  first  paragraph.Fo//ow  up  to  broad  mapping:  depending  on  the 
type/strength  of  the  evidence  and  the  length  of  the  haplotypes,  we  will  conduct  either  fine  mapping  in  related 
breeds  that  share  a  similar  phenotype,  sequence  implicated  haplotypes  using  sequence  capture,  or 
characterize  transposition  events,  structural  variation  or  DNA  methylation  status  (see  PI  (Alvarez)  biosketch, 
which  demonstrates  successful  funding  of  grants  in  this  area  from  NIH,  DoD  CDMRP  and  AKC-CHF).  The  PI  is 
expert  in  genomics  and  sequence  and  evolutionary  biology  analyses  that  will  be  required  to  fully  evaluate 
genetic  variants  and  their  possible  disease  effects. 

RS  Aim  3  Expected  results.  We  predict  that  in  Mapping  (1)  we  will  identify  the  same  locus  published 
previously  (leading  to  refining  the  locus  through  recombination  in  both  breeds),  and  that  we  will  identify  other 
loci  associated  with  osteosarcoma  risk  -  both  SNP  alleles  and  B  allele  frequency  changes  suggestive  of  CNV 
or  of  effects  resulting  in  allele-specific  SNP  genotyping  bias  from  amplification  step  [^®].  As  Deerhounds  are 
relatively  closely  related  to  Greyhounds,  we  also  expect  to  find  some  loci  shared  between  the  two,  which  would 
provide  convincing  replication  of  the  findings  in  our  preliminary  studies.  We  predict  that  in  Mapping  (2)  we  will 
find  wobblers-associated  variants.  For  both  mapping  studies  we  expect  to  identify  loci  that  could  not  have  been 
found  using  conventional  genetic  analyses.  Example  1,  in  preliminary  GWA  studies  applying  lUT  to  binary 
genotype  calling  of  the  same  lllumina  SNP  array  data  used  in  c.5.  Preliminary  Studies,  we  identified  a  genome 
wide  significant  locus  that  would  not  have  been  identified  by  conventional  Chi-Square  GWA  analysis  (not 
shown).  Strikingly,  two  of  the  three  case-control  groups  had  increased  frequency  of  the  SNP  allele  associated 
with  high  risk,  but  the  third  group  had  reduced  frequency  of  the  same  allele  associated  with  reduced  risk.  We 
propose  that,  due  to  reversal  paradox  effects  many  such  findings  cannot  be  detected  by  conventional 
GWA.  We  also  expect  to  identify  candidate  genes  (e.g.,  some  osteosarcoma  candidate  haplotypes  have  no 
more  than  one  gene)  and  variants  (e.g.,  through  sequence  capture)  within  association  loci.  Example  2,  in 
Preliminary  Studies  we  demonstrate  the  use  of  ANCOVA/lUT  to  identify  continuous  variable  differences  in  B 
allele  frequencies  associated  with  osteosarcoma  risk.  This  would  not  be  possible  with  current  approaches  that 
map  binary  SNP  alleles  (and  cannot  be  detected  indirectly  by  tag-SNPs  in  LD  when  the  variants  are  relatively 
recent).  Such  variation  may  be  indicative  of  genetic  effects  never  before  sampled  genome  wide  for  GWA,  such 
as  CNV  or  isothermal  amplification  bias  in  lllumina  Infinium  SNP  genotyping  (e.g.,  due  to  DNA  methylation, 
structural  variation,  and  retrotransposition  events).  If  our  expected  results  materialize,  as  is  strongly  supported 
by  our  preliminary  studies,  they  would  establish  the  superior  power  and  preservation  of  information  in  the 
innovative  experimental  design  and  analyses  we  propose;  and  it  would  open  the  door  to  studying  the  most 
common  (and  with  highest  mutation  rates)  types  of  genetic  variation  [^®]  for  the  first  time. 

RS  Aim  3  Potential  pitfalls  and  contingencies.  Our  preliminary  studies  support  the  feasibility  of 
applying  very  well-established  statistical  methods  for  novel  biological  data  analyses.  For  example,  applying  an 
lUT  approach  to  GWA  using  binary  genotype  data,  identified  a  SNP  locus  at  genome  wide  significance;  but  no 
locus  reached  significance  using  conventional  Chi-Square  analysis  on  the  same  genotype  data  (see  Example 
1  in  previous  section).  Notably,  others  have  recently  independently  validated  that  same  application  of  lUT.^^  A 
second  example  is  the  fact  that  the  ANCOVA/lUT  mapping  approach  identified  several  loci  that  were  covered 
by  multiple  significant  SNPs,  including  five  SNPs  in  a  600,000  kb  region  of  chr6;  the  odds  of  the  observed 
physical  genome  distribution  being  a  random  effect  are  infinitesimally  low.  The  greatest  challenges  in  the  field 
of  GWA  are  validation  of  association  and  identification  of  causative  mutations.  These  remain  potential  pitfalls 
for  us,  but  we  are  encouraged  by  the  fact  that  our  osteosarcoma  GWA  (using  lUT  of  conventional  binary 
genotypes)  in  Greyhounds  identified  one  (of  19  significant)  SNPs  within  the  4.5  Mb  interval  identified  for 


linkage  to  osteosarcoma  in  the  closely  related  Scottish  Deerhound.  This  ability  to  fine  map  across  related 
breeds  is  one  of  the  major  strengths  of  dogs,  as  are  the  reduced  phenotypic  and  genetic  heterogeneity. For 
the  mutation  detection,  we  will  be  challenged  as  is  everyone,  but  1)  we  have  improved  chances  over  most 
others  because  we  will  have  more  loci  to  prioritize  for  specific  molecular  approaches  based  on  our  types  of 
findings  (say,  structural  variation  vs.  DNA  methylation),  and  2)  we  have  the  technical  and  computational 
expertise,  and  are  using  the  most  cutting  edge  methodologies. 

C.4.  Software  development 

All  the  algorithms  developed  in  this  project  will  be  integrated  into  an  open  source  R  package  using  R  and 
Bioconductor  functions  and  packages.  The  package  will  be  tested  on  both  stand-alone  workstation  and  also 
parallel  computing  environment  including  two  clusters  available  at  OSD  (one  in  the  Ohio  Supercomputer 
Center,  one  in  the  Dept,  of  Biomedical  Informatics).  The  packages  will  be  released  on  a  project  website  and 
freely  available  to  public.  In  addition,  we  will  submit  it  to  Bioconductor  in  compliance  with  the  testing  and 
inclusion  criteria.  If  time  permits,  we  will  also  consider  integrating  the  R  package  into  a  web  tool  using  web 
interface  tools  such  as  the  Rcgi  package  (a  CGI  WWW  interface  R). 

C.5.  Preliminary  studies  &  Demonstration  of  proposed  experimental  approach 

Note:  To  demonstrate  the  novelty  and  significance,  and  the  experimental  plan  for  all  three  Aims,  we  devote 

significant  space  in  this  proposal  to  describe  our  preliminary  studies  (two  manuscripts  in  preparation).. 

Study  design  (ANCOVA/lUT  approach),  canine  osteosarcoma  (OSA).  Dog  breeds  have  ~1 00-fold 
less  genetic  variation  than  humans.  Greyhounds  were  split  over  one  hundred  years  ago  into  racing  and  show 
sub-breeds  (registered  NGA  and  AKC,  respectively). 

Strikingly,  racers  have  the  highest  OSA  rate  (25% 
incidence)  of  any  breed,  whereas  show  dogs  have  no 
increased  risk.'^^''^^  We  thus  designed  a  study  of  a 
complex  genetic  trait  in  an  outbred  mammal,  but  used 
one  of  the  simplest  such  contexts  possible.  Genotyping 
of  these  dogs  was  performed  using  the  highest  density 
SNP  array  available  in  dogs  (lllumina  HD,  173,000 
feature;  fewer  SNPs  than  humans  due  to  the  highly 
extended  linkage  disequilibrium  (LD)  in  dogs). 

Importantly,  this  genotyping  platform  provides  not  only 
the  presence  or  absence  of  the  binary  A  or  B  alleles  at 
each  marker,  but  also  the  signal  intensity  of  the  marker 
and  the  ratio  of  the  two  alleles  (referred  to  as  B  allele 
frequency,  BAF).  We  conducted  the  SNP  genotyping  in 
three  OSA  positive-negative  (case-control)  groups  in 
order  to  1)  using  ANCOVA  to  adjust  for  group 
membership  as  well  as  potentially  addressing  the  three 
reversal  paradoxes  (Yule-Simpson,  Lord's,  and 
suppression),  which  share  the  characteristic  that  the 
association  between  two  variables  can  be  reversed, 
diminished,  or  enhanced  when  another  variable  is 
statistically  controlled  for  and  2)  enable  the  use  of 
lUT  in  place  of  GWA  by  Chi-Square  analysis  with 
Bonferroni  multiple  testing  correction.  Specifically,  we 
genotyped  batches  of  12  dogs  in  the  combination  of  4 
OSA  racers,  4  OSA  free  racers  (OFR)  and  4  show 
(AKC). Statistics  &  Results:  Data  was  analyzed  using 
lllumina  GS  and  Partek  GS.  Sample  attributes  (incl. 
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Fig  1.  Application  of  ANCOVA.  Correction  of 
Greyhound  osteosarcoma  (OSA)  positive  and 
negative  continuous  variabie  genotypes  (B  aiieie 
frequencies).  (A)  Uncorrected  anaiysis  shows 
popuiation  structure  effects:  separating  OSA  positive 
and  negative  racers  apart  from  negative  AKC  show 
Greyhounds.  (B)  ANCOVA-corrected  anaiysis 
cieaniy  separates  OSA  positive  and  negative  dogs. 


racing/show  and  disease  status)  were  used  to  assign  animals  to  labiei.  Analysis  of  informative  snps  using 
conditions  for  ANCOVA  corrections.  ANCOVA  is  based  on  regressions 
and  when  used  as  a  statistical  test  assumes  that  covariates  are 
independent  variables.  In  our  ANCOVA  procedure  we  used  it  to  establish 
weighted  averages  so  that  groups  that  are  biologically  similar  have  the 
same  regression  slope.  Linear  models  in  biological  contexts  have  been 
heavily  criticized.  In  this  procedure  a  linear  model  is  entirely  appropriate 
since  we  are  classifying  based  on  known  biological  traits.  Although  this 
does  render  the  measures  arbitrary  it  allows  for  effects  to  be  isolated  that 
can  be  subjected  post  hoc  to  other  tests.  Figure  1  demonstrates  the 
effects  of  ANCOVA  isolation  on  principal  components  associated  with 
the  phenotype  of  interest.  Before  correction,  two  low  risk  groups  (AKC 
and  OFR)  fail  to  cluster  according  to  risk  due  to  population  structure. 

Regression  lines  were  computed  for  the  appropriate  factors  and 
interaction  values  were  transformed  and  weighted  to  correct  for  the  slope 
of  the  generalized  linear  model.  We  next  calculated  the  covariance 
matrix  of  the  loading  values  for  each  dataset  and  conducted  lUT  using  a 
threshold  of  ±0.6.  Many  publications  have  reported  that  Pearson 
correlation  (r)  values  of  0.4  are  biologically  significant.  Here  we  used  0.6 
assuming  it  most  likely  captures  the  most  informative  SNPs. 

A  list  of  potential  candidate  SNPs  from  the  ANCOVA/lUT  was 
identified  and  used  to  filter  genotype  information.  Genotypes  were  subjected  to  a  Chi-Square  test  of 
association  for  osteosarcoma  risk.  Non-significant  genotypes  were  eliminated  from  the  analysis.  Once  only 
SNPs  that  are  loaded  with  the  most  meaningful  measures  remained  we  conducted  t-tests  to  determine  if  they 
were  capable  of  discriminating  between  the  two  training  populations.  This  procedure  revealed  that  the 
osteosarcoma  free  racers  and  the  AKC  show  greyhounds  which  have  below  average  incidence  rate  clustered 
together  and  the  first  principle  component  explained  the  osteosarcoma  risk  variability  initially  masked  by  the 
effects  of  the  population  difference  (Fig.  1B).  We  then  went  on  to  determine  whether  it  was  a  genotypic  effect 
such  as  haplotypes  or  if  some  other  mechanism  was  associated  with  the  differential  risk  in  these  two 
populations.  Intriguingly,  regions  associated  with  altered  risk  could  not  be  identified  based  on  haplotypes 
alone.  However,  the  signal  was  derived  from  alterations  in  B  allele  frequency  that  correctly  categorizing  dogs 
across  unrelated  datasets.  The  genome  wide  significant  hits  are  shown  in  Table  1.  Encouragingly,  several 
regions  are  detected  by  multiple  SNPs  (colored),  including  five  SNPs  in  a  600,000  kb  region  of  chromosome  6. 

Preliminary  studies  conclusions.  Here  we  presented  the  first  GWA  study  of  osteosarcoma  in  any 
organism,  and  reported  approximately  twenty  hits.  Our  approach  showed  how  population  structure  can  affect 
the  ability  to  detect  biologically  relevant  genetic  effects.  In  addition,  this  is  the  first  work  to  detect  genome  wide 
significant  association  signal  using  continuous  variable  genotype  data  (B  allele  ratios)  and  ANCOVA/lUT;  we 
propose  those  loci  are  a  combination  of  CNVs  and  genetic/epigenetic  variants  with  differing  amplification  bias 
p®]  in  the  SNP  genotyping  protocol.  This  is  consistent  with  Dr.  Nadeau’s  suggestion  that  the  missing  heritability 
may  lie  in  unexplored  genome  regions  or  “in  largely  untested  classes  of  genetic  variation.”'*®  Beyond  the 
analysis  shown  here,  we  conducted  a  second  GWA  analysis  of  the  same  data,  but  applying  only  lUT  using 
binary  allele  calls  -  see  c.2.,  RS  Aim  3,  Expected  results  and  Potential  pitfalls  and  contingencies.  That  analysis 
suggested  validation  of  the  study,  as  one  of  19  genome  wide  significant  hits  is  within  the  4.5  Mb  interval  linked 
to  osteosarcoma  in  Deerhounds.  Moreover,  we  identified  SNPs  that  could  not  be  identified  by  conventional 
approaches  due  to  the  reversal  paradoxes. 


ANOVA  for  multiple  categories  of  risk. 
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31 
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34 
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Application  summary:  We  propose  to  develop  novel  applications  of  validated  statistical  approaches  to  enable 
greatly  improved  analysis  of  continuous-variable  biological  data.  This  and  the  new  applications  of  lUT  will  be 
widely  used  for  genomic  and  integrative  analyses. 
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Abstract 

Background 

Mapping  medical  terms  to  standardized  UMLS  concepts  is  a  basic  step  for  leveraging 
biomedical  texts  in  data  management  and  analysis.  However,  available  methods  and 
tools  have  major  limitations  in  handling  queries  over  the  UMLS  Metathesaurus  that 
contain  inaccurate  query  terms,  which  frequently  appear  in  real  world  applications. 

Methods 

To  provide  a  practical  solution  for  this  task,  we  propose  a  layered  dynamic 
programming  mapping  (LDPMap)  approach,  which  can  efficiently  handle  these 
queries.  LDPMap  uses  indexing  and  two  layers  of  dynamic  programming  techniques 
to  efficiently  map  a  biomedical  term  to  a  UMLS  concept. 

Results 

Our  empirical  study  shows  that  LDPMap  achieves  much  faster  query  speeds  than 
LCS.  In  comparison  to  the  UMLS  Metathesaurus  Browser  and  MetaMap,  LDPMap  is 
much  more  effective  in  querying  the  UMLS  Metathesaurus  for  inaccurately  spelled 
medical  terms,  long  medical  terms,  and  medical  terms  with  special  characters. 

Conclusions 

These  results  demonstrate  that  LDPMap  is  an  efficient  and  effective  method  for 
mapping  medical  terms  to  the  UMLS  Metathesaurus. 

Background 

Efficiently  processing  and  managing  biomedical  text  data  is  one  of  the  major  tasks  in 
many  medical  informatics  applications.  Biomedical  text  analysis  tools,  such  as 
MetaMap  [1]  and  cTAKES  [2],  have  been  developed  to  extract  and  analyze  medical 
terms  from  biomedical  text.  However,  medical  terms  often  have  multiple  names, 
which  make  the  analysis  difficult.  As  an  effort  to  standardize  medical  terms,  the 


Unified  Medical  Language  Systems  (UMLS)  [3]  maintains  a  very  valuable  resource 
of  controlled  vocabularies.  It  contains  over  200  million  medical  terms  (also  known  as 
"medical  concepts").  Each  medical  term  is  identified  by  a  unique  id  known  as  a 
Concept  Unique  Identifier  (CUI).  The  UMLS  also  records  relations  between  medical 
terms.  As  a  result,  mapping  biomedical  text  data  to  the  UMLS  and  mining  UMLS 
associated  datasets  often  yield  rich  knowledge  for  many  biomedical  applications  [4] 
[5]  [6]  [7]  [8]. 

In  order  to  effectively  query  or  use  the  UMLS,  one  of  the  fundamental  tasks  is  to 
correctly  map  a  biomedical  term  to  a  UMLS  concept.  Currently,  there  are  a  number  of 
publicly  available  tools  to  achieve  this  goal.  One  notable  approach  is  to  use  the 
official  UMLS  UTS  service  (UMLS  Metathesaurus  Browser)  available  on  the  UMLS 
official  website  (https://uts.nlm.nih.gov).  Users  are  able  to  input  a  medical  term  and 
the  system  will  return  a  query  result.  MetaMap  [1],  which  has  been  developed  and 
maintained  by  US  National  Library  of  Medicine,  has  become  a  standard  tool  in 
mapping  biomedical  text  to  the  UMLS  Metathesaurus.  cTAKES  [2]  is  an  open- 
source  natural  language  processing  system  that  can  process  clinical  notes  and  identify 
named  entities  from  various  dictionaries,  including  the  UMLS. 

However,  after  having  been  using  these  tools  in  our  research,  we  found  that  they  do 
not  work  well  in  mapping  medical  terms  that  are  just  slightly  different  from  the  terms 
in  the  UMLS.  Lor  example,  the  UMLS  Metathesaurus  Browser,  MetaMap,  and 
cTAKES  fail  to  process  the  query  term  "1-undecene-l-O-beta  2',3',4',6'-tetraacetyl 
glucopyranoside"  even  if  it  has  only  one  character  different  (missing  between 
"beta"  and  "2")  from  the  official  UMLS  concept  "l-undecene-l-O-beta-2',3',4',6'- 
tetraacetyl  glucopyranoside".  This  drawback  makes  it  hard  to  handle  many  real  world 
data  such  as  Electronic  Health  Records,  which  contain  a  lot  of  noisy  information 


including  missing  and  incorrect  data  [9].  In  addition,  they  often  fail  to  handle  long 
medieal  terms  even  if  those  terms  are  identical  to  the  terms  in  the  UMLS.  For 
example,  the  Metathesaurus  Browser  cannot  handle  query  terms  with  more  than  75 
eharaeters,  and  sometimes  cannot  even  accurately  answer  a  query  term  that  exactly 
matches  a  concept  name  in  the  UMLS  (see  discussions  in  the  result  section). 
MetaMap  and  cTAKES,  on  the  other  hand,  often  breaks  down  a  long  medical  term 
into  several  shorter  terms.  For  example,  if  we  query  MetaMap  with  a  elinical  drug 
"POMEGRANATE  ERUIT  EXTRACT  150  MG  Oral  Capsule",  we  get  several 
UMES  concepts  such  as  "Cl 509685  POMEGRANATE  FRUIT  EXTRACT", 
"C2346927  Mg++",  and  "C0442027  Oral",  instead  of  this  drug  eoncept  whieh  has  a 
unique  CUI  C3267394  in  the  UMES.  The  situation  becomes  even  worse  when 
medieal  terms  eontain  special  characters,  i.e.,  eharaeters  other  than  numbers  or  letters, 
such  as  etc.  Eor  example,  MetaMap  completely  fails  to  find  any 

relevant  CUI  to  the  medical  concept  "cyclo(Glu(OBz)-Sar-Gly-(N-eyclohexyl)Gly)2". 
These  drawbacks  are  very  undesirable  when  handling  biomedieal  texts.  By  studying 
the  UMES  Metathesaurus,  we  found  that  a  significant  number  of  medical  terms  are 
quite  long.  About  10.7%  of  UMES  concepts  contain  at  least  75  eharaeters  (including 
white  spaees),  and  about  50.9%  of  UMLS  coneepts  contains  at  least  32  eharaeters.  In 
addition,  a  large  amount  of  medical  terms  contain  special  characters.  More  than 
61.3%  of  UMLS  concepts  contain  at  least  one  special  characters  and  about  11%  of 
UMLS  concepts  contains  at  least  5  speeial  characters.  In  fact,  we  found  many  special 
characters  are  optional  in  a  medical  term.  Eor  example,  term  “Cyelic  AMP- 
Responsive  DNA-Binding  Protein”  and  term  “Cyelic  AMP  Responsive  DNA  Binding 
Protein”  both  refer  to  the  same  concept  "C0056695"  in  the  UMES  Metathesaurus, 
though  the  latter  is  missing  two  The  UMES  handles  a  medieal  term  with  different 


names  by  including  multiple  common  names  in  the  Metathesaurus.  Given  the  fact  that 
in  many  cases  special  characters  are  optional,  it  is  practically  impossible  to  let 
Metathesaurus  contain  all  possible  names.  Considering  a  UMLS  concept  with  20 
special  characters,  if  each  special  character  may  be  replaced  by  a  white  space,  then 
there  are  approximately  1  million  aliases  for  this  concept  alone,  not  to  mention  that 
more  than  0.3%  of  UMLS  concepts  contain  20  special  characters  or  more. 

This  problem  is  in  fact  related  to  the  classical  spelling  correction  problem  in  which  a 
misspelled  word  will  be  corrected  to  the  most  closely  matched  word.  The  classic 
measurement  of  dissimilarity  between  two  words  based  on  several  distance  functions, 
such  as  edit  distance  [10],  hamming  distance  [11],  and  longest  common  subsequence 
distance  [12]  [13].  Thus  the  spelling  correction  is  essentially  finding  a  valid  word  with 
the  minimum  distance  to  the  misspelled  word.  Quite  a  few  dynamic  programming 
algorithms  have  been  proposed  to  solve  this  problem.  Readers  can  find  a  survey  of 
these  algorithms  in  [14].  In  recent  years,  spelling  correction  has  evolved  to  perform 
query  corrections.  This  correction  is  often  a  task  of  context  sensitive  spelling 
correction  (CSSC),  where  corrections  will  be  geared  towards  more  meaningful  or 
frequently  searched  words  [15].  Thus,  it  is  a  good  idea  to  use  the  query  log  to  assist 
the  correction  [16]. 

Unlike  many  query  applications,  it  is  not  sufficient  to  return  a  frequently  searched 
medical  term  that  best  matches  the  query  based  on  search  history,  not  to  mention  that 
such  history  data  is  often  not  available.  Accurately  identifying  a  specific  biomedical 
term,  such  as  a  drug  name  or  a  chemical  compound,  is  demanded  by  many  biomedical 
applications.  Given  this  consideration,  classical  spelling  correction  techniques  are 
more  preferable  than  the  CSSC  for  matching  biomedical  terms  to  UMLS  concepts. 
However,  we  found  that  the  classical  dynamic  programming  algorithm  is  too  slow  for 


this  task  because  of  the  huge  volume  of  terms  in  the  UMLS  Metathesaurus.  In 
addition,  it  is  unable  to  effectively  handle  a  term  with  missing  words  (e.g.,  "gastro 
reflux"  has  a  large  distance  to  "gastro  oesophageal  reflux"  though  the  two  terms 
usually  means  the  same  thing),  or  words  not  in  their  usual  order  (e.g.,  "lymphocytic 
leukemia  chronic"  has  a  large  distance  to  "leukemia  chronic  lymphocytic"). 

The  background  described  above  motivated  us  to  find  an  efficient  and  accurate 
medical  term  mapping  method  for  the  UMLS.  To  tackle  this  challenge,  in  this  work 
we  propose  a  Layered  Dynamic  Programming  Mapping  (LDPMap)  approach  to  query 
the  UMLS  Metathesaurus. 

Methods 

We  use  Longest  Common  Subsequence  (LCS)  to  measure  the  similarity  between  two 
words.  Given  two  words  A  and  B,  their  similarity  is  defined  as: 

WordSimilarity{A,  B)=  2^\LCS{A,B)\  /  (|y4|+|5|); 

This  similarity  measure  is  a  variation  of  the  longest  common  subsequence  distance 
[12].  We  can  observe  that  WordSimilarity{A,  B)  ranges  between  0  and  1.  In  addition, 
WordSimUarity{A,  B)  =\  if  and  only  if  A  and  B  are  identical,  and  WordSimUarity{A, 
5)  =  0  if  and  only  if  A  and  B  shares  no  common  letters. 

The  function  WordSimUarity{A,  B)  is  the  basic  building  block  for  LDPMap.  In  the 
UMLS,  each  concept  is  a  sequence  of  words.  We  define  the  similarity  between  two 
concepts  a„  =  (Ai,  A2,  ...,An)  and =(Bi,  B2,  B^)  as: 

ConceptSimUarity{a„,  Pm)=  max{  ^  WordSimUarity{Ai,  Bj)); 

(iJ)ER 

Similar  to  word  similarity,  in  our  query  we  will  normalize  the  concept  similarity  by 
the  number  of  words  contained  in  each  concept.  We  can  observe  that  normalized 


concept  similarity  score  ranges  between  0  and  1.  If  two  eoneepts  are  identieal  then 
this  seore  is  1 . 

NormConceptSimilarity{a„,  fim)=  2*ConceptSimilarity{an,  fim)l{  (|«h  |+|  Pm  |); 
The  key  issue  in  the  above  definition  is  R,  whieh  is  a  matehing  relation  between 
words  in  a  and  p.  We  have  two  eonstraints  on  R,  whieh  leads  to  two  different  foei. 
Constraint  1:  There  do  not  exists  two  matehing  pairs  (ij),  {x,y)  in  R  sueh  that  i=x  or 

j=y- 

Constraint  2\  In  addition  to  eonstraint  I,  for  any  two  matehing  pairs  {ij),  (x,y)  in  R, 
either  i<x  &&  J<y,  or  x<i  &&  y<J. 

Constraint  I  eonverts  the  eoneept  similarity  problem  into  a  maximum  weighted 
bipartite  matehing  problem  [17].  Considering  a  bipartite  graph  built  on  two  vertex  sets 
ttn  and  Pm  with  word  similarities  being  the  edge  weights,  finding  a  highest  seore  for 
eoneept  similarity  under  Constraint  1  is  equivalent  to  find  a  maximum  weighted 
matehing  for  the  bipartite  graph.  This  model  is  partieularly  helpful  for  identifying  the 
similarity  between  two  terms  regardless  of  their  word  ordering.  We  used  this  as  one  of 
the  measurements  in  our  final  query  workflow  (Figure  1)  and  implemented  this  by 
maximal  weighted  matehing. 

In  the  following  seetion,  we  will  foeus  on  eoneept  similarity  ealeulation  under 
eonstraint  2,  whieh  regulates  that  the  similarity  eomparison  between  two  terms  shall 
follow  the  word  orders  in  those  terms,  similar  to  the  LCS  problem  in  whieh  matehing 
between  two  words  shall  follow  the  eharaeter  orders.  Thus,  the  eoneept  similarity 
ealeulation  problem  ean  be  eonsidered  as  a  maero  level  similarity  ealeulation  where 
eaeh  unit  is  a  word  instead  of  a  letter  as  in  the  ease  of  word  similarity  ealeulation. 
This  model  has  a  lot  of  advantages  as  we  will  see  in  the  following  seetion. 

Suboptimal  Structure  of  the  Concept  Similarity  under  Constraint  2 


Our  next  question  is  how  to  perform  the  eoneept  similarity  ealeulation.  Unlike  word 
similarity  ealeulation  in  whieh  each  match  outcome  is  a  binary  result  (i.e.,  the  same 
letter  or  a  different  letter),  each  match  in  the  concept  similarity  calculation  is  a  word 
similarity  value  between  0  and  1.  The  algorithm  for  the  word  similarity  calculation 
cannot  be  applied  to  the  concept  similarity  calculation.  However,  we  find  the  concept 
similarity  calculation  also  has  a  suboptimal  structure  as  follows: 

if  /=0  or 7=0 

ConceptSimilarity{ai,  fij)  =  0 
else 

ConceptSimilarity{ai,  pj)  =  max(ConceptSimilarity(ai.j,  pj.i)  +  WordSimilarity(Ai, 
Bj),  ConceptSimilarity(ai,  pj.i),  ConceptSimilarity(ai-i,  Pj)); 

The  above  suboptimal  structure  is  true  because  for  any  two  words  Ai  G  a,,  BjE  Pj,  there 
are  at  most  three  possible  cases: 

(1)  {i,j)  ER,  i.e.  Both  Ai  and  Bj  are  used  in  the  matching.  Then  ConceptSimilarity{ai, 
pj)=  ConceptSimilarity{ai.i,  pj.i)  +  WordSimilarity(Ai,  Bj); 

(2)  Bj  is  not  used  in  the  matching,  then  ConceptSimUarity{ai,  Pj)= 

ConceptSimdarity(ai,  Pj.}); 

(3)  Ai  is  not  used  in  the  matching,  then  ConceptSimUarity{ai,  Pj)= 

ConceptSimilarity(ai.j,  pp. 

Note  that  we  do  not  consider  it  a  valid  case  that  neither  At  nor  Bj  is  used  in  the 

matching.  In  this  case,  we  can  always  choose  to  make  them  matching  without 

violating  Constraint  1  and  result  in  a  higher  or  at  least  equal  concept  similarity  score. 


Main  Algorithms 

Given  the  suboptimal  substrueture,  we  can  design  a  dynamic  programming  algorithm 
to  calculate  the  concept  similarity  score  between  two  terms,  on  top  of  the  LCS 
dynamic  programming  algorithm  for  calculating  word  similarity.  The  two  layers  of 
dynamic  programming  not  only  result  in  a  method  less  affected  by  missing  words  or 
words  in  different  orders,  but  also  significantly  increase  the  query  speed  as  we  will 
see  below.  These  enable  our  searching  method  practically  applicable  to  many 
biomedical  applications. 

The  UMLS  Metathesaurus  (version  used  in  this  work:  2012AB)  contains  around  11 
million  records  in  its  MRCONSO.RRF  files.  Each  record  is  a  medical  term.  For  query 
purposes,  we  discard  duplicate  terms  and  non-English  terms  and  result  in  about  6.87 
million  records.  A  term  is  considered  duplicate  if  both  its  CUI  and  name  are  identical 
to  another  term.  However,  among  these  6.87  million  records,  there  are  only  1,874,573 
unique  words  (white  space  is  the  delimiter).  Thus  concept  similarity  on  a  word  basis 
saves  a  huge  amount  of  redundant  calculation  otherwise  needed  by  classic  methods  on 
a  character  basis.  Correspondingly,  in  our  method,  we  first  pre-process  the  UMES 
Metathesaurus  into  a  word  vector  of  unique  words,  and  convert  each  UMES  concept, 
which  consists  of  a  list  of  words,  into  a  list  of  indices  with  regard  to  the  word  vector. 
Procedure  EDPMap-Preprocessing  is  the  pseudo  code. 


Procedure  LDPMap-Preprocessing  ( ) 

1:  for  i=\  \  length  {Metathesaurus) 

2:  Word _Vector  =  Word _Vect or  KJ  Metathesaurus  [i\, 

3:  endfor 


4:  for  i=\:  length  {Metathesaurus) 


5:  for 7=1:  length  {Metathesaurus\i\) 

WordIndex_vector  \i,j\  =  the  index  o^Metathesaurus\i,j\  in  Word_Vector; 

6:  endfor 

7:  endfor 

8:  return  Word_Vector,  WordIndex_vector; 


We  proeess  a  query  using  the  Algorithm  LDPMap  Query.  When  a  query  proeess 
starts,  we  first  build  a  word  similarity  matrix  between  the  query  term  and  the  word 
veetor  (Line  1-5),  using  the  WordSimUarity  funetion  defined  above.  Then  we  build  a 
eoneept  seore  veetor  between  the  query  term  and  6.87  million  UMLS  Metathesaurus 
eoneepts  (Line  6-8).  The  eonstruetion  of  the  eoneept  seore  veetor  uses  the 
WordSimUarity  Matrix  built  previously  so  that  there  are  no  more  word  similarity 
ealeulations.  In  addition,  it  adopts  a  dynamie  programming  approaeh  in  Funetion 
ConceptSimilarity Score,  owing  to  the  suboptimal  strueture  of  the  ConceptSimilarity 
funetion. 


Algorithm  LDPMap  Query  (query  term) 

1:  for  i=\:  length  {query Jerm) 

2:  for 7=1 :  length  ( Word_Vector) 

3:  WordSimilarityMatrix[i,j]  =  WordSimilarity{query_term[i\,  Word_Vector\j]); 

4:  endfor 

5:  endfor 

6:  for  /=  1 : length(Metathesaurus) 

7:  ConceptScore_Vector[i]  =  ConceptSimiIarUyScore(Wordlndex_vector[i]); 


8:  endfor 


9; 


return  Concepts  in  Metathesaurus  eorresponding  to  top  seores  in 


ConceptScore_Vector, 


Function  ConceptSimilarityScore  (  Wordindex) 

1:  for /=2:x+l 
2:  for  j=2\y+\ 

3:  S(i,j)=  WordSimilarityMatrix[i-\,  WordIndex\j-\]]; 

4:  if  5(/,7)+5(/-1,7-1)  >  max 

5:  5(/,7)=5(/,7)+5(/-1,7-1); 

6:  e\seifSii-l,j)>Sii,j-l) 

7:  5(/,7)=5(/-1,7); 

8:  else 

9:  5(/,7)=5(/,7-1); 

10;  endif 

1 1 ;  endfor 
12;  endfor 

13;  return  2*5'(x+l,_y+l)  /  (x+y) ; 


A  Running  Example 

To  facilitate  the  understanding  of  our  method,  we  provide  a  simple  running  example 
of  our  method  in  Tables  1  and  2.  Assume  the  input  query  term  is  "gastro  reflux".  The 
Algorithm  LDPMap  Query  will  first  build  a  WordSimilarityMatrix  between  this 
query  term  and  the  word  vector  of  Metathesaurus.  Results  were  partially  shown  in 


Table  1. 


After  the  WordSimilarityMatrix  is  available,  the  Algorithm  LDPMap  Query  will 
ealculate  the  concept  similarity  scores  between  the  query  term  and  UMLS  concepts  by 
dynamic  programming.  The  calculation  will  refer  to  WordSimilarityMatrix  for  word 
similarity  score  instead  of  calculating  it  again.  An  example  of  a  concept  similarity 
calculation  is  given  in  Table  2. 

Complexity  Analysis 

The  LDPMap  method  is  much  faster  than  the  classic  LCS-based  word  similarity 
calculation  by  treating  the  query  term  and  each  UMLS  concept  as  one  single  word,  as 
demonstrated  in  our  empirical  study.  The  classic  LCS-based  word  similarity 
calculation  uses  dynamic  programming  on  a  character  basis  while  we  use  two  layers 
of  dynamic  programming,  one  on  a  character  basis  and  the  other  on  a  word  basis.  To 
understand  the  analytical  reason  behind  this  speedup,  let  us  make  some  simple 
assumptions.  Assume  the  UMLS  Metathesaurus  contains  M  unique  concepts,  and 
each  concept  or  query  term  contains  t  words,  and  each  word  has  d  characters.  Also 
assume  UMLS  Metathesaurus  contains  K  unique  words.  Then,  the  classic  LCS-based 
word  similarity  calculation  takes  approximately  0(^  d  M)  time  to  handle  a  query. 
However,  LDPMap  method  takes  approximately  0{td  K+t  M)  time  to  handle  this 
query.  It  is  easy  to  observe  that  K«tM.  This  explains  why  LDPMap  is  much 
efficient.  In  the  following,  we  will  see  that  our  LDPMap  approach  can  be  further  sped 
up  with  the  pipeline  technique. 

Speeding  up  LDPMap  with  the  Pipeline  Technique 

In  building  the  WordSimUarity Matrix  and  ConceptScore_Vector,  the  dynamic 
programming  method  has  been  used  for  around  1.87  million  times  and  6.87  million 
times,  respectively.  It  is  interesting  to  find  out  if  there  are  repeated  calculations  that 
can  be  reused  to  speed  up  the  LDPMap  method.  By  studying  both  the  word  vector  and 


the  Metathesaurus,  we  found  the  former  has  a  lot  of  repeated  prefixes  among  words 
(e.g.  words  “4-Ammophenol”,  “4-Ammophenyl”),  and  the  latter  has  a  lot  of  repeated 
prefix  words  among  eoneepts  (e.g.  C1931062  eetomycorrhizal  fungal  sp.  AR-Ny3, 
Cl 93 1063  eetomyeorrhizal  fungal  sp.  AR-Ny2).  Thus,  by  lexieographieally  sorting 
the  word  vector  and  the  Metathesaurus,  we  can  use  this  information  to  save  a  lot  of 
calculation  in  the  LDPMap  approach  as  follows: 

(1)  In  calculating  WordSimilarity Matrix,  Given  a  word  A,  if  it  has  p  common  prefix 
letters  with  the  previous  word  B,  the  dynamic  programming  only  needs  to  start  from 
p+l  iteration  because  the  previous  p+l  columns  of  the  dynamic  programming  table 
are  exactly  the  same  as  the  previous  results. 

(2)  In  calculating  ConceptSimilarityScore,  Given  a  concept  a,  if  it  has  q  common 
prefix  words  with  the  previous  concept  P,  the  dynamic  programming  only  needs  to 
start  from  q+\  iteration  because  the  previous  q+\  columns  of  the  dynamic 
programming  table  are  exactly  the  same  as  the  previous  results.  That  means,  the  for 
loop  in  Line  2  of  Function  ConceptSimilarityScore  shall  start  with  j=q+2. 

The  mechanism  of  the  speedup  technique  can  be  described  as  a  pipeline  technique 
because  a  computation  result  can  be  passed  down  and  partially  reused  by  the 
subsequent  computation.  In  the  empirical  study,  we  will  see  that  the  pipeline 
technique  significantly  improves  the  LDPMap  speed. 

A  Comprehensive  Query  Workflow  Using  LDPMap 
Approach 

Given  the  above  solutions  to  the  concept  similarity  problem  under  Constraints  1  and 
2,  we  will  design  a  comprehensive  query  workflow  for  mapping  a  query  term  to 
UMLS  concepts.  Our  query  workflow  needs  to  consider  multiple  types  of  input 
variations  and  errors.  Other  than  missing  words  and  words  in  different  orders  that  can 


be  properly  handled  by  eoneept  similarity  problem  formulation,  we  need  to  eonsider 
another  situation  when  two  words  are  merged  together.  In  this  situation,  the  eoneept 
similarity  modelling  does  not  fit  well  beeause  it  is  on  a  word  basis.  Therefore  it  is 
preferable  to  use  the  elassie  LCS  method.  However,  as  we  pointed  out  above,  the 
elassie  LCS  method  is  too  slow  for  the  UMLS  Metathesaurus.  Fortunately,  we  found 
that  we  ean  leverage  eoneept  similarity  solutions,  outputting  a  list  of  words  with 
similarity  seore  great  than  a  threshold.  When  we  set  the  threshold  to  be  0.35,  in  most 
oases  it  is  able  to  output  oonoepts  that  are  similar  with  the  query  term  regardless  of  the 
word  merging  issues.  The  number  of  outputted  concepts  is  much  smaller  than  the  size 
of  UMLS  Metathesaurus;  thus  applying  the  LCS  method  on  this  small  subset  is  much 
faster  than  on  the  whole  UMLS  Metathesaurus.  The  query  workflow  is  illustrated  in 
Figure  1. 

In  the  query  workflow,  we  first  calculate  concept  similarity  scores  under  Constraint  2 
between  the  query  term  and  all  UMLS  concepts.  If  there  are  concepts  with  scores 
higher  than  threshold  Ti,  we  output  the  results  and  the  query  completes.  Otherwise, 
we  save  any  concepts  with  scores  higher  than  threshold  T2  as  SET{T2),  and  then 
perform  two  additional  queries:  (1)  calculate  word  similarity  between  the  query  term 
and  each  concept  in  SET{T2)  by  treating  the  query  term  and  each  concept  as  one  single 
word;  (2)  calculate  the  concept  similarity  scores  under  Constraint  1  between  the  query 
term  and  all  UMLS  concepts.  Finally,  we  merge  and  output  the  results  from  (1)  and 
(2).  The  number  of  results  outputted  is  adjustable.  An  application  can  choose  to 
output  concepts  with  scores  higher  than  a  threshold,  or  only  the  top  ranked  concepts. 


Results 

To  understand  the  aetual  performance  of  LDPMap,  we  implemented  it  in  C++,  and 
subjected  it  to  two  sets  of  empirical  studies.  In  summary,  the  results  demonstrate  that 
LDPMap  method  performs  much  better  than  available  methods  in  terms  of  query 
speed  and  effectiveness.  All  experiments  were  carried  out  on  Linux  cluster  nodes  with 
2.4GHz  AMD  Opteron  processors.  For  the  LDPMap  query  workflow,  we  set  two 
parameters  ri=0.8  and  72=0.35. 

Query  Speed  Comparison 

We  would  like  to  know  how  fast  LDPMap  handles  query  in  comparison  with  the 
standard  LCS  method  which  treats  the  query  term  and  each  UMLS  concept  as  a  single 
word,  and  how  effective  the  pipeline  technique  for  the  LDPMap  is.  Therefore,  we  test 
the  three  algorithms,  LCS  standard,  LDPMap  (LDPMap  Query  Algorithm)  without 
the  pipeline  technique,  and  LDPMap  algorithm  with  the  pipeline  technique,  on  four 
sets  of  medical  concepts  randomly  chosen  from  the  UMLS  Metathesaurus.  The  first 
set  consists  of  1000  single-word  medical  concepts.  The  second,  third  and  fourth  sets 
consist  of  1000  two-word,  1000  three-word,  and  1000  four-word  concepts, 
respectively.  The  results  are  shown  in  Figure  2. 

From  Figure  2  we  can  observe  that  the  LDPMap  algorithm  is  much  faster  than  the 
standard  LCS.  In  addition,  the  standard  LCS  method  is  susceptible  to  the  word 
numbers  in  a  query  term  while  the  LDPMap  method  is  much  more  stable.  This  result 
is  consistent  with  the  above  complexity  analysis.  In  addition,  the  LDPMap  with  the 
pipeline  technique  significantly  speeds  up  the  basic  LDPMap  method.  This  confirms 
our  intuition  that  the  pipeline  technique  saves  huge  amounts  of  redundant 
computation  thus  improving  the  efficiency  of  the  LDPMap  method.  As  a  result,  we 
can  see  that  in  this  set  of  experiments  LDPMap  with  pipeline  techniques  on  average 
answers  a  query  in  less  than  1  second.  However,  the  standard  LCS  method  takes  about 


1  to  2  minutes  in  answering  a  query,  whieh  makes  it  virtually  unacceptable  for  many 
biomedical  applications,  which  can  require  near  real-time  responses,  or  when 
processing  large  amounts  of  data.  In  addition  to  the  slow  query  time,  the  standard 
LCS  is  not  good  at  processing  query  terms  with  missing  words  or  words  in  different 
orders,  as  we  have  discussed  above. 

It  is  worthwhile  to  note  that  even  for  one  word  query,  LDPMap  method  is 
significantly  faster  than  LCS,  though  the  concept  similarity  is  exactly  the  same  as  the 
word  similarity  in  this  case.  This  is  because  the  LDPMap  pre-processed  the  UMLS 
terms  on  a  word  basis  and  built  an  efficient  index.  The  similarity  measurement  is  not 
directly  on  the  UMLS  terms  but  on  words  and  the  index  which  saves  a  lot  of 
computational  cost.  In  contrast,  the  LCS  will  handle  the  similarity  measurement 
directly  over  every  UMLS  term.  This  can  also  be  explained  by  our  complexity 
analysis  above.  When  t=\  (t  is  the  number  of  words  in  a  query),  LCS  complexity  is 
0{cfM)  while  the  LDPMap  is  0{cfK+M).  Since  K«M,  we  conclude  that  LDPMap  is 
much  faster  than  LCS. 

Next,  we  would  like  to  know  how  effective  LDPMap  handles  queries,  especially 
when  the  query  terms  are  slightly  different  than  the  terms  in  the  UMLS 
Metathesaurus. 

Query  Effectiveness  Comparison 

To  understand  how  effective  LDPMap  (referring  to  LDPMap  query  workflow  in  this 
set  of  experiments)  handles  queries  with  name  variations  and  errors,  we  used  two 
available  methods,  UMLS  Metathesaurus  Browser  and  MetaMap  as  benchmarks.  In  a 
cursory  examination  of  cTAKES,  we  found  that  it  exhibited  similar  characteristics  to 
MetaMap  in  its  ability  to  handle  name  variations  and  errors  and  therefore  we  have 
excluded  it  from  comparison.  Since  the  study  on  UMLS  Metathesaurus  Browser 


requires  manually  inputting  terms  and  eheeking  the  results,  we  have  to  limit  the  query 
test  to  manageable  numbers.  In  addition,  sinee  the  UMLS  Metathesaurus  Browser 
eannot  aeeept  a  query  term  with  more  than  75  eharaeters,  we  limit  all  query  terms  in 
our  test  to  be  no  more  than  75  eharaeters.  Given  the  above  situations,  and  eonsidering 
the  faet  that  more  than  50%  of  UMLS  eoneepts  eontain  at  least  32  eharaeters,  we 
randomly  ehose  100  medieal  eoneepts  with  32-75  eharaeters  from  the  UMLS 
Metathesaurus. 

The  100  medieal  eoneepts  are  divided  into  two  groups.  The  first  group  eonsists  of  50 
eoneepts  with  no  speeial  eharaeters  (i.e.,  eharaeters  other  than  letters  and  numbers), 
and  the  seeond  group  eontains  50  eoneepts  with  5  or  more  speeial  eharaeters.  The  two 
groups  are  for  two  different  testing  purposes. 

Group  7:  We  will  use  group  1  to  test  how  effeetive  the  query  workflow  handles  pure 
English  name  terms,  and  English  name  terms  with  input  errors,  variations,  and  typos. 
Thus,  in  addition  to  querying  the  original  names,  we  also  query  the  names  with  1,  2, 
3,  and  4  eharaeter  variations.  Charaeter  variations  are  generated  randomly  in  this 
study,  ineluding  (1)  deleting  a  eharaeter,  (2)  replaeing  a  eharaeter,  (3)  merging  two 
words,  i.e.,  deleting  the  white  spaee  between  two  words. 

Group  2:  We  will  use  group  2  to  test  how  effeetive  the  query  algorithm  is  in  handling 
many  professional  medieal  terms,  whieh  may  eontain  a  good  number  of  speeial 
eharaeters,  sueh  as  ehemieal  eompounds  and  drugs.  To  simulate  the  name  variations 
that  frequently  appear  in  these  terms,  we  randomly  apply  1,  2,  3,  and  4  eharaeter 
variations,  ineluding  (1)  deleting  a  speeial  eharaeter,  (2)  replaeing  a  speeial  eharaeter 
by  a  white  spaee. 


To  complement  the  above  test  groups,  we  use  the  following  group  to  test  how 
effeetive  the  query  algorithm  handles  short  terms  whieh  may  be  queried  eommonly  in 
real  situation. 

Group  3:  We  randomly  pieked  100  medieal  eoneepts  with  5-31  eharaeters.  Sinee 
many  of  these  eoneepts  are  quite  short,  we  only  apply  1  and  2  random  eharaeter 
variations,  ineluding  (1)  deleting  a  eharaeter,  (2)  replaeing  a  eharaeter,  (3)  merging 
two  words. 

In  these  experiments,  we  found  that  MetaMap  often  output  multiple  matehing  results 
but  there  are  no  ranks  of  these  results.  In  eontrast,  the  UMLS  Metathesaurus  Browser 
usually  outputs  a  list  of  ranked  eoneepts,  and  LDPMap  ean  be  eonfigured  to  output 
the  top  k  {k>=\)  ranked  eoneepts. 

Thus,  to  be  as  fair  as  possible,  we  use  two  eriteria  to  measure  the  eorreetness  of  a 
query: 

Criterion  1:  A  query  is  eorreet  if  the  original  term  appear  (1)  in  top  25  ranked 
eoneepts  (i.e.,  in  the  first  page  of  the  result)  by  the  UMLS  Metathesaurus  Browser;  (2) 
in  the  top  25  ranked  eoneepts  by  LDPMap;  (3)  in  the  result  of  MetaMap. 

Criterion  2:  A  query  is  eorreet  if  the  original  term  appears  (1)  as  the  top  ranked 
eoneept  by  UMLS  Metathesaurus  Browser;  (2)  as  the  top  ranked  eoneept  by 
LDPMap. 

Criterion  1  indieates  if  the  query  proeessing  meehanism  is  able  to  handle  the  query 
with  reasonable  aeeuraey.  Criterion  2  is  mueh  stringent  and  it  indieates  whether  a 
method  ean  be  applied  to  applieations  require  high  aeeuraey. 

Figures  3  and  4  are  the  error  rate  for  the  two  groups  of  experiments,  under  Criterion  1 . 
From  both  figures,  we  ean  elearly  see  that  the  LDPMap  approaeh  has  very  few  errors 
among  all  tests.  In  eomparison,  the  UMLS  Metathesaurus  Browser  and  MetaMap ’s 


error  rate  are  quite  high  especially  when  multiple  characters  changes  are  present. 
MetaMap  has  a  considerable  error  rate  even  when  querying  the  original  terms  (0 
characters  changes).  This  may  owe  to  the  text  processing  mechanism  of  MetaMap. 
Since  MetaMap  is  targeted  at  finding  medical  terms  from  a  biomedical  text,  it 
leverages  a  combination  of  part-of-speech  tagging,  shallow  parsing,  and  longest 
spanning  match  against  terms  from  the  SPECIALIST  Lexicon  before  matching  terms 
against  concepts  in  the  UMLS.  Therefore,  it  tends  to  decompose  longer  spans  of  text 
and  medical  terms  into  several  shorter  medical  terms. 

Ligures  5  and  6  are  the  error  rates  for  the  two  groups  of  experiments,  under  Criterion 
2.  Since  MetaMap  usually  outputs  multiple  concepts  without  ranking,  we  exclude 
MetaMap  from  the  Criterion  2  measurement.  Lrom  these  two  figures,  we  can  observe 
that  the  error  rate  of  the  UMLS  Metathesaurus  Browser  is  much  higher  in  comparison 
with  the  measurement  of  Criterion  1.  Quite  surprisingly,  there  are  some  errors  even 
when  querying  a  few  original  terms  (such  as  "  Distal  radioulnar  joint").  This  suggests 
that  UMLS  Metathesaurus  Browser  is  not  suitable  for  query  processing  for 
applications  that  have  a  high-accuracy  demand.  In  contrast,  the  LDPMap  still  has  a 
very  low  error  rate,  on  average  less  than  5%  across  the  0-5  character  changes,  and  free 
of  errors  in  querying  the  original  terms. 

Prom  Ligures  7  and  8,  we  can  see  that  the  general  performances  of  LDPMap,  UMLS 
Metathesaurus  Browser,  and  MetaMap  on  short  query  terms  are  similar  to  their 
performances  on  long  query  terms.  LDPMap  still  has  a  clear  advantage  over  UMLS 
Metathesaurus  Browser,  and  MetaMap.  However,  we  noticed  that  LDPMap  error  rate 
reaches  27%  for  2  character  changes  under  Criterion  2.  This  is  understandable 
because  generally  short  terms  contain  fewer  words  than  long  terms,  and  the  concept 
similarity  measurement  is  less  favoured.  However,  the  parameter  T\  can  be  used  as  an 


adjustment  of  preference  between  the  concept  similarity  measurement  and  the  word 
similarity  measurement.  By  increasing  T\  from  0.8  to  0.85,  we  observed  that  this  error 
rate  reduces  from  27%  to  20%.  This  demonstrates  that  LDPMap  is  flexible  in 
handling  both  long  and  short  term  queries. 

To  provide  some  details  on  the  medical  concepts  we  used  in  this  set  of  experiments, 
and  the  character  changes  applied.  We  list  a  few  of  them  in  Tables  3.  From  this  table, 
we  can  see  that  it  contains  concepts  of  different  lengths.  The  randomly  generated 
character  variations  cover  several  common  cases  of  text  data  inaccuracy,  including, 
misspellings,  merging  of  two  words,  and  special  character  omissions.  From  Table  4 
we  can  see  that  MetaMap  cannot  handle  them  properly.  Instead,  it  finds  some 
concepts  related  to  individual  words  in  the  query  term.  The  UMLS  Metathesaurus 
Browser  does  not  do  any  better  on  them.  In  contrast,  LDPMap  correctly  answered  all 
these  queries  except  for  "Albunexlectable  Product".  Although  "Injectable  Product"  is 
not  correct,  it  is  at  least  closer  to  the  original  term  than  those  returned  by  the  UMLS 
Metathesaurus  Browser  and  MetaMap.  By  reviewing  the  LDPMap  approach,  we 
conclude  that  this  error  can  be  eliminated  if  we  increase  the  threshold  Ti  to  a  value 
such  that  word  similarity  (LCS)  is  used  to  measure  the  two  terms.  To  confirm  this,  we 
increase  T\  from  0.8  to  0.85,  and  LDPMap  successfully  returns  the  original  term. 
However,  a  high  T\  implies  that  LDPMap  gives  more  preference  to  LCS-based 
similarity  measurement  than  to  concept  similarity  measurement  defined  above. 
Consequently,  LDPMap  will  be  less  productive  in  handling  real-world  queries  that 
contain  incomplete  medical  terms  (i.e.,  medical  terms  with  missing  words).  It  is  quite 
evident  that  there  does  not  exist  one  set  of  Ti  and  that  fits  all  situations.  As  a  result, 
we  will  fine  tune  these  parameters  to  leverage  LDPMap  in  our  future  applications. 


Conclusions 

In  the  work  we  proposed  LDPMap,  a  layered  dynamie  programming  approaeh  to 
effieiently  mapping  inaeeurate  medieal  terms  to  UMLS  eoneepts.  As  a  main 
advantage  of  the  LDPMap  algorithm,  it  runs  mueh  faster  than  elassieal  LCS  method 
therefore  makes  it  possible  to  effieiently  handle  UMLS  term  queries.  When  similarity 
is  eounted  on  a  word  basis,  LDPMap  algorithm  may  yield  a  more  desirable  result  than 
LCS.  In  other  eases  (sueh  as  word  merging),  it  is  possible  that  LCS  query  results  are 
more  preferable.  Thus,  in  the  eomprehensive  query  workflow  of  LDPMap,  the 
LDPMap  method  is  eomplemented  by  LCS  and  adjustable  by  parameter  T\.  Different 
from  using  LCS  alone,  the  LDPMap  query  workflow  only  applies  LCS  (when  needed) 
to  a  very  limited  number  of  eandidate  terms  thus  aehieves  a  very  fast  query  speed. 

In  query  effeetiveness  comparison,  we  observed  that  LDPMap  has  a  very  high 
accuracy  in  processing  queries  over  the  UMLS  Metathesaurus  involving  inaccurate 
terms.  In  contrast,  the  UMLS  Metathesaurus  Browser  has  a  very  limited  ability  in 
handling  these  queries,  though  it  can  handle  queries  of  accurate  terms  fairly  well. 
Throughout  the  study,  we  also  observed  that  MetaMap,  in  general,  is  not  suitable  for 
mapping  long  medical  terms  to  the  UMLS  concepts  as  it  focuses  on  extracting  short 
medical  terms  from  the  query  text. 

Although  LDPMap  is  very  efficiently  in  handling  UMLS  term  queries,  it  has  two 
major  limitations.  First,  it  cannot  handle  synonyms  and  coreferences.  Fortunately, 
UMLS  Metathesaurus  often  list  a  concept  preferred  names  and  synonyms  so  that 
LDPMap  can  work  effectively  in  most  cases,  though  the  list  may  still  not  be  complete. 
Second,  it  is  not  able  to  perform  syntax-level  processing  as  MetaMap  does,  such  as 
extracting  medical  terms  from  an  article.  Whether  it  is  possible  to  extend  the  LDPMap 
approach  to  overcome  the  two  limitations  remains  an  open  question.  In  the  future  we 
would  like  to  investigate  this  question  and  plan  to  use  LDPMap  as  an  efficient  pre- 


processing  tool  to  map  medical  terms  to  the  UMLS  concepts,  and  use  the  results  in 
our  knowledge  discovery  platform. 
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Figures 

Figure  1 

A  Comprehensive  Query  Workflow  Using  LDPMap 


Figure  2 

Query  time  of  LCS,  LDPMap  and  LDPMap  pipeline  on  randomly  chosen  1000 
medical  concepts. 


Figure  3 

Correctness  comparison  on  LDPMap,  UMLS  Metathesaurus  Browser,  and  MetaMap 
for  Group  1  using  Criterion  1 . 


Figure  4 

Correctness  comparison  on  LDPMap,  UMLS  Metathesaurus  Browser,  and  MetaMap 
for  Group  2  using  Criterion  1 . 


Figure  5 

Correctness  comparison  on  LDPMap  and  UMLS  Metathesaurus  Browser  for  Group  1 
using  Criterion  2. 


Figure  6 

Correctness  comparison  on  LDPMap  and  UMLS  Metathesaurus  Browser  for  Group  2 
using  Criterion  2. 


Figure  7 

Correctness  comparison  on  LDPMap,  UMLS  Metathesaurus  Browser,  and  MetaMap 
for  Group  3  using  Criterion  1 . 


Figure  8 

Correctness  comparison  on  LDPMap  and  UMLS  Metathesaurus  Browser  for  Group  3 
using  Criterion  2. 


Tables 

Table  1. 

An  example  of  WordSimilarity Matrix  constructed  for  query  term  "gastro  reflux". 


Table  2 

An  example  of  ealeulating  the  eoneept  similarity  score  between  the  query  term 


"gastro  reflux"  and  the  UMLS  concept  "gastro  oesophageal  reflux"  for  the 
ConceptScore_Vector  construction.  The  calculation  will  refer  to  the 
WordSimilarity Matrix  as  shown  in  Table  1.  The  normalized  final  similarity  score  is 
2*2/(2+3)=0.8. 


UMLS  concept 

gastro 

oesophageal 

reflux 

word  index 

i 

k 

j 

query 

term 

order 

0 

0 

0 

gastro 

1 

0 

1 

1 

1 

reflux 

2 

0 

1 

1.23594 

2 

Table  3 

Original  terms  and  their  randomly  generated  eharaeter  variations 


CUI 

name 

Randomly  generated  4  character 
variations 

C3267394 

POMEGRANATE  FRUIT  EXTRACT 
150  MG  Oral  Capsule 

POMGRAATE  FRUIT  EXTRdCT  150 
MG  Oral  Casule 

C3228202 

Albunex  Injectable  Product 

Albunexlectable  Product 

C0505183 

Lateral  branch  of  dorsal  ramus  of  fifth 
thoracic  spinal  nerve 

LateMa  branch  of  dorsal  ramus  of  ifth 
thoracic  gpinal  nerve 

C1459293 

Sinorhizobium  americanus 

Sinokhizrbimamericanus 

C1541607 

gpl00/IL-7/ISA-5 1/MART- 1 

gplOO  IE  7ISA-51/MART1 

C1352046 

danthron  1.5  MG/ML  /  Pantothenic  Acid 
2.5  MG/ML  Oral  Suspension 

danthron  15  MGML  Pantothenic  Acid  25 
MG/ML  Oral  Suspension 

C0040372 

Benzenesulfonamide,  N-(((hexahydro- 

1  H-azepin- 1  -yl)amino)carbonyl)-4  - 
methyl- 

Benzenesulfonamide,  N-((  hexahydrolH- 
azepin-1  -yl  amino)carbonyl-4-methyl- 

C2714409 

1  -undecene- 1  -O-beta-2',3  ',4',6'-tetraacetyl 
glucopyranoside 

1  -undecene  1  -0-beta2,3  ',4',6-tetraacetyl 
glucopyranoside 

Table  4 

Query  results  for  Table  3. 


CUI 

UMLS  Metathesaurus 
Browser  (concept 

ranked  1st  by 

approximate  match) 

MetaMap 

LDPMap 

C3267394 

C0030054  Oxygen 

COO  16767  Fruit,  C2346927  Mg++,  and  4 
others 

correct 

C3228202 

Cl 5 14468  product 

Cl 704444  Product  (Multiplicative  Product) 
[Quantitative  Concept] 

C15 14468  product  [Entity] 

C0086466 

Injectable 

Product 

C0505183 

C0007965  Chediak- 
Higashi  Syndrome 

C1706131  Branch(Branch(group)), 

C2700383  Branch(Branch  of  plant),  and  6 
others 

correct 

C1459293 

No  result 

No  result 

correct 

C1541607 

C1512807  Integrated 
Learning  System 

C0020898  IL  (Illinois  (geographic 
location)). 

Cl  522481  MART-1  (MART-1  Tumor 
Antigen) , 
and  2  others 

correct 

C1352046 

C0029383  Osmium 

Cl  129294  danthron  25  MG, 

C0439526  /mL  [Quantitative  Concept], 
and  3  others 

correct 

C0040372 

C0265215  Meckel- 
Gruber  syndrome 

C0053169  benzenesulfonamide, 

C0441922  N-l-  (N-l-  (tumor  staging)), 
and  two  others 

correct 

C2714409 

C0030011  Oxidation 

C0470206  +1  [Q  uantitative  Concept] 

C 14 17683  BETA2  (NEURODl  gene), 
and  7  others 

correct 

